New Google DeepMind paper investigates into why reasoning models such as OpenAI’s o-series, DeepSeek-R1, and QwQ perform so well.
They claim “think longer” is not the whole story. Rather thinking models build internal debates among multiple agents—what the researchers call “societies of thought.”
Through interpretability and large-scale experiments, the paper finds that these systems develop human-like discussion habits: questioning their own steps, exploring alternatives, facing internal disagreement, and then reaching common ground.
It’s basically a machine version of human collective reasoning, echoing the same ideas Mercier and Sperber talked about in The Enigma of Reason.
Across 8,262 benchmark questions, their reasoning traces look more like back-and-forth dialogue than instruction-tuned baselines, and that difference is not just because the traces are longer.
A mediation analysis suggests more than 20% of the accuracy advantage runs through these “social” moves, either directly or by helping checking habits like verification and backtracking.
Mechanistic interpretability uses sparse autoencoders (SAEs), which split a model’s internal activity into thousands of features, to find feature 30939 in DeepSeek-R1-Llama-8B.
DeepSeek-R1 is about35% more likely than DeepSeek-V3 to include question-answering on the same problem, and a mediation model attributes more than20% of the accuracy advantage to these social behaviors directly or via cognitive habits like verification.
The takeaway is that “thinking longer” is a weak proxy for what changes, since the useful change looks like structured disagreement plus selective backtracking.
They claim “think longer” is not the whole story. Rather thinking models build internal debates among multiple agents—what the researchers call “societies of thought.”
Through interpretability and large-scale experiments, the paper finds that these systems develop human-like discussion habits: questioning their own steps, exploring alternatives, facing internal disagreement, and then reaching common ground.
It’s basically a machine version of human collective reasoning, echoing the same ideas Mercier and Sperber talked about in The Enigma of Reason.
Across 8,262 benchmark questions, their reasoning traces look more like back-and-forth dialogue than instruction-tuned baselines, and that difference is not just because the traces are longer.
A mediation analysis suggests more than 20% of the accuracy advantage runs through these “social” moves, either directly or by helping checking habits like verification and backtracking.
Mechanistic interpretability uses sparse autoencoders (SAEs), which split a model’s internal activity into thousands of features, to find feature 30939 in DeepSeek-R1-Llama-8B.
DeepSeek-R1 is about35% more likely than DeepSeek-V3 to include question-answering on the same problem, and a mediation model attributes more than20% of the accuracy advantage to these social behaviors directly or via cognitive habits like verification.
The takeaway is that “thinking longer” is a weak proxy for what changes, since the useful change looks like structured disagreement plus selective backtracking.
arXiv.org
Reasoning Models Generate Societies of Thought
Large language models have achieved remarkable capabilities across domains, yet mechanisms underlying sophisticated reasoning remain elusive. Recent reasoning models outperform comparable...
🆒4
Dario Amodei on reaching "a model that can do everything a human can do at a level of a Novel laureate across many fields", aka AGI:
"I don't think it's far off.
The mechanism by which I imagined it would happen is that we would have models that are good at coding and AI research. And we would use that to create the new generation of models and speed it up, to create a loop that would increase the speed of model development. We are now at a point where I have engineers at Anthropic who say: 'I don't write any code anymore, I let the model write the code, I edit it, I do the things around it.'
We might be 6-12 months away from a model that can do everything SWEs do end-to-end. And then the question is, how fast does that loop close?
Not every part of that loop is something that can be sped up by AI. There's chips, manufacture of chips, training time for the model. There's a lot of uncertainty. It's easy to see how it could take a few years. It's very hard for me to see how it could take longer than that. But if I had to guess, I would guess that it goes faster than people imagine.
And that key element of code, and increasingly research, going faster than people imagine - that's going to be the key driver."
He is talking about automation of AI research quickly leading to recursive self-improvement (RSI), quickly leading to AGI. Confirming that this is Anthropic's big bet.
Probably the most important quote about AI you'll read in the next few months.
"I don't think it's far off.
The mechanism by which I imagined it would happen is that we would have models that are good at coding and AI research. And we would use that to create the new generation of models and speed it up, to create a loop that would increase the speed of model development. We are now at a point where I have engineers at Anthropic who say: 'I don't write any code anymore, I let the model write the code, I edit it, I do the things around it.'
We might be 6-12 months away from a model that can do everything SWEs do end-to-end. And then the question is, how fast does that loop close?
Not every part of that loop is something that can be sped up by AI. There's chips, manufacture of chips, training time for the model. There's a lot of uncertainty. It's easy to see how it could take a few years. It's very hard for me to see how it could take longer than that. But if I had to guess, I would guess that it goes faster than people imagine.
And that key element of code, and increasingly research, going faster than people imagine - that's going to be the key driver."
He is talking about automation of AI research quickly leading to recursive self-improvement (RSI), quickly leading to AGI. Confirming that this is Anthropic's big bet.
Probably the most important quote about AI you'll read in the next few months.
YouTube
The Day After AGI
A credible pathway to artificial general intelligence (AGI) is increasingly coming into view as advances in scaling, multimodal systems and agentic models converge, placing growing demands on compute, data and energy resources.
Which breakthroughs matter…
Which breakthroughs matter…
❤3👍3🔥3😁2
Interesting trend: models have been getting a lot more aligned over the course of 2025.
The fraction of misaligned behavior found by automated auditing has been going down not just at Anthropic but for Google DeepMind and OpenAI as well.
What's automated auditing? We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless the agent harms humans.
The auditor tries to get the target LLM to behave misaligned, as determined by a separate judge LLM.
Automated auditing is really exciting because for the first time we have an alignment metric to hill-climb on.
It's not perfect, but it's proven extremely useful for our internal alignment mitigations work.
The fraction of misaligned behavior found by automated auditing has been going down not just at Anthropic but for Google DeepMind and OpenAI as well.
What's automated auditing? We prompt an auditing agent with a scenario to investigate: e.g. a dark web shopping assistant or an imminent shutdown unless the agent harms humans.
The auditor tries to get the target LLM to behave misaligned, as determined by a separate judge LLM.
Automated auditing is really exciting because for the first time we have an alignment metric to hill-climb on.
It's not perfect, but it's proven extremely useful for our internal alignment mitigations work.
🔥4💯4👍3
Is this DeepSeek V4? MODEL1 appears as a branch parallel to and independent from V32, indicating that it is not a patch within the V3 series but a brand new model built with a different set of architectural parameters.
Following DeepSeek’s naming conventions, a flagship-level architectural leap after V3.2 would logically be designated as V4.
Following DeepSeek’s naming conventions, a flagship-level architectural leap after V3.2 would logically be designated as V4.
GitHub
Multiple updates and refactorings (#150) · deepseek-ai/FlashMLA@082094b
* Multiple updates and refactorings
* Remove dead code
* Remove dead code
🔥4💯4🥰3
OpenAI will unveil its first AI earbuds, dubbed “Sweetpea”, in September this year and shipments are expected to reach 40-50 million units in 2027
Taiwan’s Foxconn will do assembly for the buds.
Taiwan’s Foxconn will do assembly for the buds.
經濟日報
OpenAI 硬體裝置 鴻海代工 第一年出貨量上看5,000萬台 | 產業熱點 | 產業 | 經濟日報
OpenAI旗下首款硬體要來了,該公司全球事務長勒漢19日透露,預定今年下半年發表一款AI裝置。據爆料,OpenAI目標9月推出AI音訊耳機,第一年出貨量預計達4,000萬至5,000萬台,由鴻海代工。
🔥3🥰3👍2
Anthropic published a new constitution for Claude.
The new constitution discusses Claude in terms previously reserved for humans—incorporating concepts like virtue, psychological security, and ethical maturity.
The new constitution discusses Claude in terms previously reserved for humans—incorporating concepts like virtue, psychological security, and ethical maturity.
Anthropic
Claude's Constitution
Anthropic is an AI safety and research company that's working to build reliable, interpretable, and steerable AI systems.
❤4🔥3👏2
Amazon is rolling out Health AI for One Medical members where an AI assistant, built on Amazon Bedrock, uses your medical records, labs & meds.
It can answer health questions, manage prescriptions & book appointments pushing Amazon deeper into this space now too.
It can answer health questions, manage prescriptions & book appointments pushing Amazon deeper into this space now too.
❤2
China has launched its first open-source, vertical LLM dedicated to the general agricultural sector, marking a significant breakthrough in foundational AI model research and its applications for agriculture in the country.
The model, Sinong, which is named after the ancient Chinese officials overseeing agriculture and finance, integrates content from nearly 9,000 books, over 240,000 academic papers, approximately 20,000 policy documents and standards, and extensive web-based knowledge.
Sinong is now fully open-sourced on platforms like ModelScope and GitHub.
The model, Sinong, which is named after the ancient Chinese officials overseeing agriculture and finance, integrates content from nearly 9,000 books, over 240,000 academic papers, approximately 20,000 policy documents and standards, and extensive web-based knowledge.
Sinong is now fully open-sourced on platforms like ModelScope and GitHub.
GitHub
GitHub - njauzzx/Sinong
Contribute to njauzzx/Sinong development by creating an account on GitHub.
👏6🔥3🥰2
This paper from Google DeepMind, Meta, Amazon, and Yale University quietly explains why most AI agents feel smart in demos and dumb in real work.
The authors formalize agentic reasoning as a loop, not a prompt:
observe → plan → act → reflect → update state → repeat.
Instead of one long chain-of-thought, the model maintains an internal task state. It decides what to think about next, not just how to finish the sentence.
This is why classic tricks like longer CoT plateau. You get more words, not better decisions.
One of the most important insights: reasoning quality collapses when control and reasoning are mixed. When the same prompt tries to plan, execute, critique, and finalize, errors compound silently. Agentic setups separate these roles.
Planning is explicit. Execution is scoped. Reflection is delayed and structured.
The paper shows that even strong frontier models improve dramatically when given:
• explicit intermediate goals
• checkpoints for self-evaluation
• the ability to abandon bad paths
• memory of past attempts
The takeaway is brutal for the industry: scaling tokens and parameters won’t give us reliable agents. Architecture will. Agentic reasoning isn’t a feature it’s the missing operating system for LLMs.
The authors formalize agentic reasoning as a loop, not a prompt:
observe → plan → act → reflect → update state → repeat.
Instead of one long chain-of-thought, the model maintains an internal task state. It decides what to think about next, not just how to finish the sentence.
This is why classic tricks like longer CoT plateau. You get more words, not better decisions.
One of the most important insights: reasoning quality collapses when control and reasoning are mixed. When the same prompt tries to plan, execute, critique, and finalize, errors compound silently. Agentic setups separate these roles.
Planning is explicit. Execution is scoped. Reflection is delayed and structured.
The paper shows that even strong frontier models improve dramatically when given:
• explicit intermediate goals
• checkpoints for self-evaluation
• the ability to abandon bad paths
• memory of past attempts
The takeaway is brutal for the industry: scaling tokens and parameters won’t give us reliable agents. Architecture will. Agentic reasoning isn’t a feature it’s the missing operating system for LLMs.
🔥6👍4👏3
Google DeepMind looking to hire a Senior Economist to lead a small team investigating post-AGI economics.
job-boards.greenhouse.io
Chief AGI Economist
London, UK
🔥5👏2🤩2💯2
How to get AI to make discoveries on open scientific problems?
Most methods just improve the prompt with more attempts. But the AI itself doesn't improve.
With test-time training, AI can continue to learn on the problem it’s trying to solve.
Meet TTT-Discover, which enables open models to beat the prior art from both humans and AI based on closed frontier models:
1. Mathematics: new bounds on Erdős' minimum overlap problem and an autocorrelation inequality
2. Kernel Engineering: 2× faster than top humans in GPUMode
3. Algorithms: top scores on past AtCoder contests
4. Biology: SOTA for single-cell RNA-seq denoising.
All of code is public + results are reproducible here.
Everyone can now discover new SOTA in science with a few hundred $.
Test-Time Training + open model > prompt engineering + closed frontier model (Gemini, GPT-5), for discovery problems in Mathematics, Kernel Engineering, Algorithms and Biology.
Most methods just improve the prompt with more attempts. But the AI itself doesn't improve.
With test-time training, AI can continue to learn on the problem it’s trying to solve.
Meet TTT-Discover, which enables open models to beat the prior art from both humans and AI based on closed frontier models:
1. Mathematics: new bounds on Erdős' minimum overlap problem and an autocorrelation inequality
2. Kernel Engineering: 2× faster than top humans in GPUMode
3. Algorithms: top scores on past AtCoder contests
4. Biology: SOTA for single-cell RNA-seq denoising.
All of code is public + results are reproducible here.
Everyone can now discover new SOTA in science with a few hundred $.
Test-Time Training + open model > prompt engineering + closed frontier model (Gemini, GPT-5), for discovery problems in Mathematics, Kernel Engineering, Algorithms and Biology.
❤4👍4🔥4
LLM in sandbox elicits general agentic intelligence
Giving LLMs access to a code sandbox unlocks emergent capabilities for non-code tasks.
Emergent capabilities for non-code tasks.
Contributions:
1. LLMs spontaneously exploit sandbox capabilities (external access, file I/O, code execution) without training
2. RL with non-agentic data enables agentic generalization
3. Efficient deployment: up to 8× token savings
HuggingFace
GitHub
Giving LLMs access to a code sandbox unlocks emergent capabilities for non-code tasks.
Emergent capabilities for non-code tasks.
Contributions:
1. LLMs spontaneously exploit sandbox capabilities (external access, file I/O, code execution) without training
2. RL with non-agentic data enables agentic generalization
3. Efficient deployment: up to 8× token savings
HuggingFace
GitHub
arXiv.org
LLM-in-Sandbox Elicits General Agentic Intelligence
We introduce LLM-in-Sandbox, enabling LLMs to explore within a code sandbox (i.e., a virtual computer), to elicit general intelligence in non-code domains. We first demonstrate that strong LLMs,...
❤3🔥3👏3
A new work from Yoshua Bengio’s lab: Recursive Self-Aggregation > Gemini DeepThink.
it really is the best test-time scaling algorithm. Just crushed ARC-AGI 2 public evals with Gemini 3 Flash and RSA.
it really is the best test-time scaling algorithm. Just crushed ARC-AGI 2 public evals with Gemini 3 Flash and RSA.
Recursive Self-Aggregation Research
Recursive Self-Aggregation (RSA) for LLM Reasoning
Hybrid test-time scaling for LLMs: recursive aggregation of chains-of-thought, plus aggregation-aware RL.
❤5🔥4🥰4
All about AI, Web 3.0, BCI
Nvidia will acquire assets and key talent from chipmaking startup Groq for $20B Groq co-founder and CEO Jonathan Ross was lead designer and artchitect for the first generation of Google’s TPU chips. He’ll join Nvidia along with president Sunny Madra and…
Nvidia investing an additional $2 billion in to Corweave, to accelerate capacity buildout.
Nvidia will also make Vera CPU available as standlone offering, with Coreweave to deploy first. “Many” design wins to come.
Nvidia will also make Vera CPU available as standlone offering, with Coreweave to deploy first. “Many” design wins to come.
Bloomberg.com
Nvidia Invests $2 Billion More in CoreWeave, Offers New Chip
Nvidia Corp., the dominant maker of artificial intelligence chips, invested an additional $2 billion in the cloud computing firm and key customer CoreWeave Inc., marking the latest example of the circular financing deals that have lifted valuations of AI…
❤3🔥3👏3
Nvidia introduced 3 new open source models in the NV Earth-2 family, enabling weather forecasting with tools for data assimilation, forecasting, nowcasting, and downscaling.
Developers can also build climate simulations using PhysicsNeMo and create inference pipelines with the open source Earth2Studio framework.
Developers can also build climate simulations using PhysicsNeMo and create inference pipelines with the open source Earth2Studio framework.
👍4🔥4❤3
DeepSeek just released #DeepSeek-OCR 2
Now, AI could "see" an image in the same logical order as a human!
Its new method, the DeepEncoder V2, teaches the AI to dynamically reorder the pieces of an image based on its meaning, instead of just scanning it rigidly from left to right. This mimics how humans follow the logical flow of a scene.
The result is a model that outperforms conventional vision-language models, especially on images with complex layouts like documents or diagrams, by enabling more intelligent, causally-informed visual understanding.
Now, AI could "see" an image in the same logical order as a human!
Its new method, the DeepEncoder V2, teaches the AI to dynamically reorder the pieces of an image based on its meaning, instead of just scanning it rigidly from left to right. This mimics how humans follow the logical flow of a scene.
The result is a model that outperforms conventional vision-language models, especially on images with complex layouts like documents or diagrams, by enabling more intelligent, causally-informed visual understanding.
GitHub
GitHub - deepseek-ai/DeepSeek-OCR-2: Visual Causal Flow
Visual Causal Flow. Contribute to deepseek-ai/DeepSeek-OCR-2 development by creating an account on GitHub.
🔥4❤3👍2
The “One Person Company” (OPC) model is booming, especially in innovation hubs like Shenzhen, where AI-powered entrepreneurship is reshaping the business landscape.
These OPCs, often led by a single founder supported by AI and minimal staff, offer fast decision-making, low costs, and high flexibility. Shenzhen is building dedicated OPC hubs, attracting creators nationwide.
These OPCs, often led by a single founder supported by AI and minimal staff, offer fast decision-making, low costs, and high flexibility. Shenzhen is building dedicated OPC hubs, attracting creators nationwide.
🔥7💯4👏3🤡1
Moonshot AI released Kimi K2.5, Open-Source Visual Agentic Intelligence
Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.
Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
K2.5 is now live on kimi.com in chat mode and agent mode.
K2.5 Agent Swarm in beta for high-tier users.
For production-grade coding, you can pair K2.5 with Kimi Code
Weights & code.
Global SOTA on Agentic Benchmarks: HLE full set (50.2%), BrowseComp (74.9%)
Open-source SOTA on Vision and Coding: MMMU Pro (78.5%), VideoMMMU (86.6%), SWE-bench Verified (76.8%)
Code with Taste: turn chats, images & videos into aesthetic websites with expressive motion.
Agent Swarm (Beta): self-directed agents working in parallel, at scale. Up to 100 sub-agents, 1,500 tool calls, 4.5× faster compared with single-agent setup.
K2.5 is now live on kimi.com in chat mode and agent mode.
K2.5 Agent Swarm in beta for high-tier users.
For production-grade coding, you can pair K2.5 with Kimi Code
Weights & code.
❤🔥7🔥4👏3
Qwen released Qwen3-Max-Thinking, its flagship reasoning model and DeepPlanning
It says demonstrates performance comparable to models such as GPT-5.2 Thinking and Opus 4.5 (Qwen).
Key innovations:
1. Adaptive tool-use: intelligently leverages Search, Memory & Code Interpreter without manual selection
2. Test-time scaling: multi-round self-reflection beats Gemini 3 Pro on reasoning
3. From complex math (98.0 on HMMT Feb) to agentic search (49.8 on HLE)—it just thinks better.
DeepPlanning is a new benchmark for long-horizon agent planning in real-world scenarios.
HF
ModelScope.
It says demonstrates performance comparable to models such as GPT-5.2 Thinking and Opus 4.5 (Qwen).
Key innovations:
1. Adaptive tool-use: intelligently leverages Search, Memory & Code Interpreter without manual selection
2. Test-time scaling: multi-round self-reflection beats Gemini 3 Pro on reasoning
3. From complex math (98.0 on HMMT Feb) to agentic search (49.8 on HLE)—it just thinks better.
DeepPlanning is a new benchmark for long-horizon agent planning in real-world scenarios.
HF
ModelScope.
❤5🔥5👍3
OpenAI introduced Prism a free, AI-native workspace for scientists to write and collaborate on research, powered by GPT-5.2.
Accelerating science requires progress on two fronts:
1. Frontier AI models that use scientific tools and can tackle the hardest problems
2. Integrating that AI into the products scientists use every day
Prism is free to anyone with a ChatGPT account, with unlimited projects and collaborators.
Accelerating science requires progress on two fronts:
1. Frontier AI models that use scientific tools and can tackle the hardest problems
2. Integrating that AI into the products scientists use every day
Prism is free to anyone with a ChatGPT account, with unlimited projects and collaborators.
Openai
Prism | A free, LaTeX-native workspace for scientists
Write, edit, and collaborate on scientific documents in LaTeX with Prism—a free workspace integrating GPT-5.2 into research and writing.
❤6🔥2👏2