OpenRouter collaborated with a16z to publish the State of AI - an empirical report on how LLMs have been used on OpenRouter.
After analyzing more than 100 trillion tokens across hundreds of models and 3+ million users (excluding 3rd party) from the last year.
A lot of insights:
1. One finding: OpenRouter observe a Cinderella "Glass Slipper" effect for new models.
Early users a new LLM either churn quickly or become part of a foundational cohort, with much higher retention than others. They are early adopters who can "lead" the rest of the market.
2. Open vs Closed Weights:
By late 2025, open-weight models (abbreviated as OSS below) reached ~⅓ of usage, sustained beyond launch spikes, but have plateaued in Q4.
3. Chinese models: grew from ~1% to around 30% in some weeks. Release velocity + quality make the market lively.
If you want a single picture of the modern stack:
- Closed models = high-value workloads
- Open models = high-volume workloads
And what we have seen is that a lot of teams use both.
OSS isn't "just for tinkering" - it is extremely popular in two areas:
• Roleplay / creative dialogue: >50% of OSS usage
• Programming assistance: ~15-20%.
4. Now the significant platform shift: agentic inference
Tracked it via:
- reasoning model adoption
- tool calling
- prompt/completion “shape” (sequence lengths).
5. Reasoning models go from “negligible” to more than 50% of tokens in 2025. Full paradigm shift.
6. Languages: English dominates with more than 80% of tokens, but the tail is real - Chinese, Russian, Spanish, etc.
7. Economics: price matters, but less than you think.On cost vs usage map, the trendline is nearly flat: reducing cost by 10% only correlates with ~0.5-0.7% more usage.
After analyzing more than 100 trillion tokens across hundreds of models and 3+ million users (excluding 3rd party) from the last year.
A lot of insights:
1. One finding: OpenRouter observe a Cinderella "Glass Slipper" effect for new models.
Early users a new LLM either churn quickly or become part of a foundational cohort, with much higher retention than others. They are early adopters who can "lead" the rest of the market.
2. Open vs Closed Weights:
By late 2025, open-weight models (abbreviated as OSS below) reached ~⅓ of usage, sustained beyond launch spikes, but have plateaued in Q4.
3. Chinese models: grew from ~1% to around 30% in some weeks. Release velocity + quality make the market lively.
If you want a single picture of the modern stack:
- Closed models = high-value workloads
- Open models = high-volume workloads
And what we have seen is that a lot of teams use both.
OSS isn't "just for tinkering" - it is extremely popular in two areas:
• Roleplay / creative dialogue: >50% of OSS usage
• Programming assistance: ~15-20%.
4. Now the significant platform shift: agentic inference
Tracked it via:
- reasoning model adoption
- tool calling
- prompt/completion “shape” (sequence lengths).
5. Reasoning models go from “negligible” to more than 50% of tokens in 2025. Full paradigm shift.
6. Languages: English dominates with more than 80% of tokens, but the tail is real - Chinese, Russian, Spanish, etc.
7. Economics: price matters, but less than you think.On cost vs usage map, the trendline is nearly flat: reducing cost by 10% only correlates with ~0.5-0.7% more usage.
OpenRouter
State of AI 2025: 100T Token LLM Usage Study | OpenRouter
Read OpenRouter's comprehensive 2025 State of AI report — an empirical 100 trillion token study of real LLM usage, model trends, and developer ecosystem insights.
❤5🔥5🥰4
Meta published a new paper on what is the path to safer superintelligence: co-improvement.
Everyone is focused on self-improving AI, but:
1) we don't know how to do it yet, and
2) it might be misaligned with humans.
Co-improvement: instead, build AI that collaborates with us to solve AI faster, and to help fix the alignment problem together.
Everyone is focused on self-improving AI, but:
1) we don't know how to do it yet, and
2) it might be misaligned with humans.
Co-improvement: instead, build AI that collaborates with us to solve AI faster, and to help fix the alignment problem together.
arXiv.org
AI & Human Co-Improvement for Safer Co-Superintelligence
Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to...
🔥5🥰3👏3
Nvidia introduced CUDA 13.1. It is the biggest expansion of CUDA since it launched in 2006.
CUDA Tile, a new way to program GPUs that makes powerful AI and accelerated computing easier for more developers to use.
CUDA Tile, a new way to program GPUs that makes powerful AI and accelerated computing easier for more developers to use.
GitHub
GitHub - NVIDIA/cutile-python: cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs - NVIDIA/cutile-python
❤6🔥2🥰2
All about AI, Web 3.0, BCI
Essential AI just dropped Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases Researchers labeled 23.6B documents from Common Crawl with a 12-category…
Essential AI introduced their first open models, Rnj-1 base and instruct 8B parameter models.
Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI.
Lots of wins with Rnj-1.
1. SWE bench performance close to GPT 4o.
2. Tool use outperforming all comparable open source models.
3. Mathematical reasoning (AIME’25) nearly at par with GPT OSS MoE 20B.
Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI.
Lots of wins with Rnj-1.
1. SWE bench performance close to GPT 4o.
2. Tool use outperforming all comparable open source models.
3. Mathematical reasoning (AIME’25) nearly at par with GPT OSS MoE 20B.
A transformer's attention could be 99% sparser without losing its smarts.
A new research from MPI-IS, Oxford, and ETH Zürich shows it can.
A simple post-training method strips away redundant connections, revealing a cleaner, more interpretable circuit.
This suggests much of the computation we rely on is just noise.
A new research from MPI-IS, Oxford, and ETH Zürich shows it can.
A simple post-training method strips away redundant connections, revealing a cleaner, more interpretable circuit.
This suggests much of the computation we rely on is just noise.
arXiv.org
Sparse Attention Post-Training for Mechanistic Interpretability
We introduce a simple post-training method that makes transformer attention sparse without sacrificing performance. Applying a flexible sparsity regularisation under a constrained-loss objective,...
🔥3❤2🥰2
the-state-of-enterprise-ai_2025-report.pdf
9.7 MB
OpenAI released their first State of Enterprise AI
OpenAI's data from 1M+ business customers reveals explosive growth—and a widening divide between leaders and laggards.
ChatGPT Enterprise seats: 9x growth year-over-year, serving 7M workplace seats. Message volume: 8x increase. API reasoning token consumption per organization: 320x growth. Custom GPTs/Projects usage: 19x increase year-to-date, now processing 20% of all Enterprise messages.
Over 9,000 organizations processed 10+ billion tokens; nearly 200 exceeded 1 trillion. BBVA regularly uses 4,000+ GPTs—AI has become core infrastructure, not an experimental tool.
Measurable Impact:
Productivity: 75% of workers report improved speed or quality. Average time savings: 40-60 minutes per active day. Workers in data science, engineering, and communications save 60-80 minutes.
Business outcomes:
Intercom's Fin Voice: 53% of calls resolved end-to-end, 40% faster resolution when human agents needed
Lowe's Mylow: 2x online conversion rate, +200 basis points customer satisfaction
Indeed: 20% more applications, 13% higher downstream success (interviews/hires)
BBVA: 9,000+ queries automated annually, equivalent of 3 FTEs redeployed
Task expansion: 75% of workers complete tasks they previously couldn't. Coding messages outside engineering/IT/research grew 36% in six months. AI is redistributing technical capabilities across organizations.
Industry & Geography
Median sector growth: 6x YoY. Technology leads at 11x, healthcare 8x, manufacturing 7x.
International surge: Australia (187%), Brazil (161%), Netherlands (153%), France (146%) lead business customer growth. International API customers: 70% growth in six months. Japan has the most corporate API customers outside the U.S.
The Widening Gap:
Frontier workers (95th percentile) send 6x more messages than median. Among data analysts, frontier users leverage analysis tools 16x more. The gap is widest in coding (17x), writing (11x), and analysis (10x).
Frontier firms generate 2x more messages per seat and 7x more messages to GPTs than median enterprises.
The underutilization problem: Among monthly active users, 19% never used data analysis, 14% never used reasoning, 12% never used search. Users engaging with ~7 task types save 5x more time than those using ~4 types.
What Leaders Do Differently?
Enable deep system integration with secure data access
Standardize workflows through Custom GPTs and shared solutions
Secure executive sponsorship with clear mandates
Codify institutional knowledge into machine-readable formats
Combine centralized governance with distributed enablement
Critical barrier: ~25% of enterprises still haven't enabled data connectors—while leaders make this their first step.
OpenAI's data from 1M+ business customers reveals explosive growth—and a widening divide between leaders and laggards.
ChatGPT Enterprise seats: 9x growth year-over-year, serving 7M workplace seats. Message volume: 8x increase. API reasoning token consumption per organization: 320x growth. Custom GPTs/Projects usage: 19x increase year-to-date, now processing 20% of all Enterprise messages.
Over 9,000 organizations processed 10+ billion tokens; nearly 200 exceeded 1 trillion. BBVA regularly uses 4,000+ GPTs—AI has become core infrastructure, not an experimental tool.
Measurable Impact:
Productivity: 75% of workers report improved speed or quality. Average time savings: 40-60 minutes per active day. Workers in data science, engineering, and communications save 60-80 minutes.
Business outcomes:
Intercom's Fin Voice: 53% of calls resolved end-to-end, 40% faster resolution when human agents needed
Lowe's Mylow: 2x online conversion rate, +200 basis points customer satisfaction
Indeed: 20% more applications, 13% higher downstream success (interviews/hires)
BBVA: 9,000+ queries automated annually, equivalent of 3 FTEs redeployed
Task expansion: 75% of workers complete tasks they previously couldn't. Coding messages outside engineering/IT/research grew 36% in six months. AI is redistributing technical capabilities across organizations.
Industry & Geography
Median sector growth: 6x YoY. Technology leads at 11x, healthcare 8x, manufacturing 7x.
International surge: Australia (187%), Brazil (161%), Netherlands (153%), France (146%) lead business customer growth. International API customers: 70% growth in six months. Japan has the most corporate API customers outside the U.S.
The Widening Gap:
Frontier workers (95th percentile) send 6x more messages than median. Among data analysts, frontier users leverage analysis tools 16x more. The gap is widest in coding (17x), writing (11x), and analysis (10x).
Frontier firms generate 2x more messages per seat and 7x more messages to GPTs than median enterprises.
The underutilization problem: Among monthly active users, 19% never used data analysis, 14% never used reasoning, 12% never used search. Users engaging with ~7 task types save 5x more time than those using ~4 types.
What Leaders Do Differently?
Enable deep system integration with secure data access
Standardize workflows through Custom GPTs and shared solutions
Secure executive sponsorship with clear mandates
Codify institutional knowledge into machine-readable formats
Combine centralized governance with distributed enablement
Critical barrier: ~25% of enterprises still haven't enabled data connectors—while leaders make this their first step.
Qwen introduced Soft Adaptive Policy Optimization (SAPO) — a smooth, stable, and highly effective RL method for training LLM
SAPO replaces hard boundaries with a continuous, temperature‑controlled gate that:
•Smooth trust‑region behavior → no abrupt gradient drop
• Sequence-level coherence → align sequence‑level behavior
• Token-level adaptivity → preserves useful gradients & boosts sample efficiency
• Asymmetric temperatures → significantly improved stability, esp. in MoE models
What does this mean in practice?
1. Longer stable RL runs
2. Higher Pass@1
3. Stronger performance on Qwen3‑VL across math, coding & multimodal tasks
SAPO offers a more scalable and reliable foundation for RL-tuning large language & multimodal models.
Paper.
SAPO replaces hard boundaries with a continuous, temperature‑controlled gate that:
•Smooth trust‑region behavior → no abrupt gradient drop
• Sequence-level coherence → align sequence‑level behavior
• Token-level adaptivity → preserves useful gradients & boosts sample efficiency
• Asymmetric temperatures → significantly improved stability, esp. in MoE models
What does this mean in practice?
1. Longer stable RL runs
2. Higher Pass@1
3. Stronger performance on Qwen3‑VL across math, coding & multimodal tasks
SAPO offers a more scalable and reliable foundation for RL-tuning large language & multimodal models.
Paper.
arXiv.org
Soft Adaptive Policy Optimization
Reinforcement learning (RL) plays an increasingly important role in enhancing the reasoning capabilities of large language models (LLMs), yet stable and performant policy optimization remains...
🔥3👍2🥰2
This research introduces VisPlay, a self-evolving framework where a single vision-language model splits into a "Questioner" and a "Reasoner" to generate its own training data.
It autonomously improves reasoning and reduces hallucinations across major benchmarks, pointing toward scalable, self-improving AI.
GitHub.
It autonomously improves reasoning and reduces hallucinations across major benchmarks, pointing toward scalable, self-improving AI.
GitHub.
arXiv.org
VisPlay: Self-Evolving Vision-Language Models from Images
Reinforcement learning (RL) provides a principled framework for improving Vision-Language Models (VLMs) on complex reasoning tasks. However, existing RL approaches often rely on human-annotated...
🔥4👍3❤2
SOTA open-source vibe coding from your home. Mistral Introduced the Devstral 2 coding model family
Two sizes, both open source.
Also, meet Mistral Vibe, a native CLI, enabling end-to-end automation.
Mistral Vibe CLI is an open-source command-line coding assistant powered by Devstral.
It explores, modifies, and executes changes across your codebase using natural language. Also under Apache 2.0.
Install via: uv tool install mistral-vibe
Two sizes, both open source.
Also, meet Mistral Vibe, a native CLI, enabling end-to-end automation.
Mistral Vibe CLI is an open-source command-line coding assistant powered by Devstral.
It explores, modifies, and executes changes across your codebase using natural language. Also under Apache 2.0.
Install via: uv tool install mistral-vibe
mistral.ai
Introducing: Devstral 2 and Mistral Vibe CLI. | Mistral AI
State-of-the-art, open-source agentic coding models and CLI agent.
👏3🔥2🥰2
Meta released Ax 1.0: an open-source platform for adaptive experimentation at scale.
Ax uses ML to automate complex, resource-intensive experiments, enabling efficient optimization for AI, infrastructure, and hardware.
Ax uses ML to automate complex, resource-intensive experiments, enabling efficient optimization for AI, infrastructure, and hardware.
Engineering at Meta
Efficient Optimization With Ax, an Open Platform for Adaptive Experimentation
We’ve released Ax 1.0, an open-source platform that uses machine learning to automatically guide complex, resource-intensive experimentation. Ax is used at scale across Meta to improve AI models, t…
🔥4🆒4🥰2👏2
Anthropic shipped three new updates for Claude Agent SDK to make it easier to build custom agents:
- Support for 1M context windows
- Sandboxing
- V2 of our TypeScript interface
GitHub.
- Support for 1M context windows
- Sandboxing
- V2 of our TypeScript interface
GitHub.
Claude API Docs
Agent SDK reference - TypeScript
Complete API reference for the TypeScript Agent SDK, including all functions, types, and interfaces.
🔥6🥰4👏3
Google released the FACTS Benchmark Suite
It’s the industry’s first comprehensive test evaluating LLM factuality across four dimensions: internal model knowledge, web search, grounding, and multimodal inputs.
It’s the industry’s first comprehensive test evaluating LLM factuality across four dimensions: internal model knowledge, web search, grounding, and multimodal inputs.
Google DeepMind
FACTS Benchmark Suite: a new way to systematically evaluate LLMs factuality
The FACTS Benchmark Suite provides a systematic evaluation of Large Language Models (LLMs) factuality across three areas: Parametric, Search, and Multimodal reasoning.
❤4🔥2👏2
Travis Beals, a Google executive working on the orbital data-center effort, said it would take 10,000 satellites to recreate the compute capacity of a gigawatt data center, assuming 100-kilowatt satellites.
The Wall Street Journal
Exclusive | Bezos and Musk Race to Bring Data Centers to Space
Jeff Bezos and Elon Musk are racing to take the trillion-dollar data-center boom into orbit.
🔥4❤2👏2
NVIDIA presents Alpamayo-R1
It's a vision-language-action model that uses "Chain of Causation" reasoning to plan.
It cuts off-road events by 35% and improves decision-making in complex scenarios, showing a promising path to more capable autonomy.
It's a vision-language-action model that uses "Chain of Causation" reasoning to plan.
It cuts off-road events by 35% and improves decision-making in complex scenarios, showing a promising path to more capable autonomy.
arXiv.org
Alpamayo-R1: Bridging Reasoning and Action Prediction for...
End-to-end architectures trained via imitation learning have advanced autonomous driving by scaling model size and data, yet performance remains brittle in safety-critical long-tail scenarios...
🔥3🥰3👏3
Google released the Gemini Deep Research agent for developers.
It can create a plan, spot gaps, and autonomously navigate the web to produce detailed reports.
Built on Gemini 3 Pro, it was trained using multi-step reinforcement learning to increase accuracy and reduce hallucinations.
It handles massive context – analyzing your uploaded docs alongside the web – and provides citations so you can verify every claim.
Deep Research is the first agent released on the new Interactions API – offering a single endpoint for agentic workflows.
It can create a plan, spot gaps, and autonomously navigate the web to produce detailed reports.
Built on Gemini 3 Pro, it was trained using multi-step reinforcement learning to increase accuracy and reduce hallucinations.
It handles massive context – analyzing your uploaded docs alongside the web – and provides citations so you can verify every claim.
Deep Research is the first agent released on the new Interactions API – offering a single endpoint for agentic workflows.
Google
Build with Gemini Deep Research
We have reimagined Gemini Deep Research to be more powerful than ever, now accessible to developers via the new Interactions API.
🔥2🥰2👏2❤1
OpenAI shipped a new model. GPT-5.2 showcases OpenAI's incredible post-training stack in action: significant gains in knowledge work (think building a financial model), long-context capability, and coding.
GPT-5.2 likely involved additional mid-training to refresh the cutoff date, plus significant amounts of RL.
One catch: OpenAI raised pricing 40%. Is it worth it?
SWE-Bench Pro results offer an interesting perspective. GPT-5.2 is able to reach higher scores at comparable cost to 5.1 Codex Max, while also continuing to push the capability ceiling.
This price hike will directly increase OpenAI's margins.
We saw a similar dynamic with Claude models, whereby Opus 4.5 was able to achieve comparable scores to Sonnet 4.5 at much lower cost.
This is due to models becoming increasing token efficient, requiring less thinking to get more done.
GPT-5.2 likely involved additional mid-training to refresh the cutoff date, plus significant amounts of RL.
One catch: OpenAI raised pricing 40%. Is it worth it?
SWE-Bench Pro results offer an interesting perspective. GPT-5.2 is able to reach higher scores at comparable cost to 5.1 Codex Max, while also continuing to push the capability ceiling.
This price hike will directly increase OpenAI's margins.
We saw a similar dynamic with Claude models, whereby Opus 4.5 was able to achieve comparable scores to Sonnet 4.5 at much lower cost.
This is due to models becoming increasing token efficient, requiring less thinking to get more done.
Openai
Introducing GPT-5.2
GPT-5.2 is our most advanced frontier model for everyday professional work, with state-of-the-art reasoning, long-context understanding, coding, and vision. Use it in ChatGPT and the OpenAI API to power faster, more reliable agentic workflows.
👍4🔥4👏4
Apple briefly posted then quickly pulled an arXiv paper, but the v1 snapshot is wild.
The team reveals RLAX, a scalable RL framework on TPUs.
It's built with a parameter server design where a master trainer pushes weights and massive inference fleets pull them to generate rollouts.
With new curation and alignment tricks and preemption friendly engineering, RLAX boosts QwQ-32B pass@8 by 12.8 percent in only 12h48m on 1024 v5p TPUs.
The team reveals RLAX, a scalable RL framework on TPUs.
It's built with a parameter server design where a master trainer pushes weights and massive inference fleets pull them to generate rollouts.
With new curation and alignment tricks and preemption friendly engineering, RLAX boosts QwQ-32B pass@8 by 12.8 percent in only 12h48m on 1024 v5p TPUs.
🔥4🥰4👏3
First comprehensive framework for how AI agents actually improve through adaptation.
Researchers from many universities surveyed the rapidly expanding landscape of agentic AI adaptation.
What they found: a fragmented field with no unified understanding of how agents learn to use tools, when to adapt the agent versus the tool, and which strategies work for which scenarios.
These are all important for building production-ready AI agents.
Adaptation in agentic AI follows four distinct paradigms that most practitioners conflate or ignore entirely.
The framework organizes all adaptation strategies into two dimensions.
- Agent Adaptation (A1, A2): modifying the agent's parameters, representations, or policies.
- Tool Adaptation (T1, T2): optimizing external components like retrievers, planners, and memory modules while keeping the agent frozen.
Researchers from many universities surveyed the rapidly expanding landscape of agentic AI adaptation.
What they found: a fragmented field with no unified understanding of how agents learn to use tools, when to adapt the agent versus the tool, and which strategies work for which scenarios.
These are all important for building production-ready AI agents.
Adaptation in agentic AI follows four distinct paradigms that most practitioners conflate or ignore entirely.
The framework organizes all adaptation strategies into two dimensions.
- Agent Adaptation (A1, A2): modifying the agent's parameters, representations, or policies.
- Tool Adaptation (T1, T2): optimizing external components like retrievers, planners, and memory modules while keeping the agent frozen.
GitHub
Awesome-Adaptation-of-Agentic-AI/paper.pdf at main · pat-jj/Awesome-Adaptation-of-Agentic-AI
Repo for "Adaptation of Agentic AI". Contribute to pat-jj/Awesome-Adaptation-of-Agentic-AI development by creating an account on GitHub.
🔥3❤2👏2
Diffusion LLMs are the new frontier? InclusionAI has released LLaDA 2.0—the first diffusion model to scale to 100B params, matching frontier LLMs while achieving 2× faster inference
LLaDA is 2.3x faster on average. We see unique high-TPF advantages in Coding via parallel decoding.
The Challenge: AR models had a 3-year head start.
GitHub.
GitHub.
LLaDA is 2.3x faster on average. We see unique high-TPF advantages in Coding via parallel decoding.
The Challenge: AR models had a 3-year head start.
GitHub.
GitHub.
GitHub
GitHub - inclusionAI/dFactory: Easy and Efficient dLLM Fine-Tuning
Easy and Efficient dLLM Fine-Tuning. Contribute to inclusionAI/dFactory development by creating an account on GitHub.
❤5🔥5👏3
NVIDIA launched the open Nemotron 3 model family, starting with Nano (30B-3A), which pushes the frontier of accuracy and inference efficiency with a novel hybrid SSM Mixture of Experts architecture.
Super and Ultra are coming in the next few months.
Nemotron 3 Super (~4X bigger than Nano) and Ultra (~16X bigger than Nano) are pretrained using NVFP4, a new "Latent Mixture of Experts" architecture that allows us to use 4X more experts for the same inference cost, and Multi-Token Prediction.
Super and Ultra are coming in the next few months.
Nemotron 3 Super (~4X bigger than Nano) and Ultra (~16X bigger than Nano) are pretrained using NVFP4, a new "Latent Mixture of Experts" architecture that allows us to use 4X more experts for the same inference cost, and Multi-Token Prediction.
❤4🔥4🥰2
a16z released 17 crypto predictions for 2026. Most are obvious. A few are not.
The ones worth paying attention to:
1. Privacy becomes the strongest moat
Bridging tokens is easy. Bridging secrets is hard. Users on private chains are less likely to leave.
Winner-take-most dynamics emerge.
2. Know Your Agent (KYA)
Non-human identities outnumber human employees 96-to-1 in financial services.
The agent economy's bottleneck is identity.
3. AI agents are taxing the open web
They extract value from ad-supported sites while bypassing revenue streams.
The web needs real-time, usage-based compensation or content creation collapses.
The ones worth paying attention to:
1. Privacy becomes the strongest moat
Bridging tokens is easy. Bridging secrets is hard. Users on private chains are less likely to leave.
Winner-take-most dynamics emerge.
2. Know Your Agent (KYA)
Non-human identities outnumber human employees 96-to-1 in financial services.
The agent economy's bottleneck is identity.
3. AI agents are taxing the open web
They extract value from ad-supported sites while bypassing revenue streams.
The web needs real-time, usage-based compensation or content creation collapses.
a16z crypto
17 things we're excited about for crypto in 2026 - a16z crypto
❤3🔥3💯3