Salesforce introduced SFR-DeepResearch (SFR-DR): RL-trained autonomous agents that can reason, search, and code their way through deep research tasks.
SFR-DR agents are trained to operate independently, without pre-defined multi-agent workflows. They autonomously plan, reason, and propose and take actions as defined by their tools.
SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search, browsing, and Python interpreter, surpassing DeepResearch with OpenAI o3 and Kimi Researcher.
SFR-DR agents are also trained to manage their own memory by summarizing previous results when context becomes limited. This enables a virtually unlimited context window, enabling long-horizon tasks
SFR-DR agents are trained to operate independently, without pre-defined multi-agent workflows. They autonomously plan, reason, and propose and take actions as defined by their tools.
SFR-DR-20B achieves 28.7% on Humanity's Last Exam (text-only) using only web search, browsing, and Python interpreter, surpassing DeepResearch with OpenAI o3 and Kimi Researcher.
SFR-DR agents are also trained to manage their own memory by summarizing previous results when context becomes limited. This enables a virtually unlimited context window, enabling long-horizon tasks
arXiv.org
SFR-DeepResearch: Towards Effective Reinforcement Learning for...
Equipping large language models (LLMs) with complex, interleaved reasoning and tool-use capabilities has become a key focus in agentic AI research, especially with recent advances in...
❤5🔥3🥰2
A new research paper from Thinking Machines (ex-openAI team): Why LLM Gives Different Answers to the Same Question (And How to Fix It)
Ever notice that ChatGPT gives you slightly different responses when you ask the same question multiple times? Even at temperature 0, where the model should theoretically always pick the most likely token?
Most people assume this happens because of sampling randomness or GPU parallelization quirks. The conventional wisdom goes something like this: "GPUs do parallel calculations, floating-point math isn't associative, so results vary depending on which threads finish first."
This explanation isn't wrong, but it misses the real culprit. Horace He and the team at Thinking Machines dug deeper and found something more fundamental: batch invariance.
Here's what's actually happening: when you send a request to an LLM API, your output depends not just on your input, but on how many other people are using the service at the same time.
The server batches requests together for efficiency, and the batch size affects the numerical computations.
Even though each individual operation might be deterministic, the same input can produce different outputs depending on whether it's processed alone or with 10, 100, or 1000 other requests.
Think of it this way: you ask a question, but the answer changes based on how crowded the "room" is when you ask it.
This work challenges a common attitude in ML: "our systems are already probabilistic, so what's a little more randomness?" The researchers argue this is defeatist. With careful engineering, we can understand and eliminate these sources of nondeterminism.
They've open-sourced their implementation on top of vLLM, making it possible for others to achieve truly deterministic LLM inference today.
Ever notice that ChatGPT gives you slightly different responses when you ask the same question multiple times? Even at temperature 0, where the model should theoretically always pick the most likely token?
Most people assume this happens because of sampling randomness or GPU parallelization quirks. The conventional wisdom goes something like this: "GPUs do parallel calculations, floating-point math isn't associative, so results vary depending on which threads finish first."
This explanation isn't wrong, but it misses the real culprit. Horace He and the team at Thinking Machines dug deeper and found something more fundamental: batch invariance.
Here's what's actually happening: when you send a request to an LLM API, your output depends not just on your input, but on how many other people are using the service at the same time.
The server batches requests together for efficiency, and the batch size affects the numerical computations.
Even though each individual operation might be deterministic, the same input can produce different outputs depending on whether it's processed alone or with 10, 100, or 1000 other requests.
Think of it this way: you ask a question, but the answer changes based on how crowded the "room" is when you ask it.
This work challenges a common attitude in ML: "our systems are already probabilistic, so what's a little more randomness?" The researchers argue this is defeatist. With careful engineering, we can understand and eliminate these sources of nondeterminism.
They've open-sourced their implementation on top of vLLM, making it possible for others to achieve truly deterministic LLM inference today.
Thinking Machines Lab
Defeating Nondeterminism in LLM Inference
Reproducibility is a bedrock of scientific progress. However, it’s remarkably difficult to get reproducible results out of large language models.
For example, you might observe that asking ChatGPT the same question multiple times provides different results.…
For example, you might observe that asking ChatGPT the same question multiple times provides different results.…
❤4🔥4🥰2
Chinese researchers introduced WebExplorer, which is a simple yet effective approach to train long-horizon web agents.
Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data.
8B model is able to outperform most 32B or even 72B models on BrowseComp and HLE.
Instead of depending heavily on rigid pre-defined graph structures, WebExplorer utilizes the model-based exploration strategy to synthesize high-quality agentic data.
8B model is able to outperform most 32B or even 72B models on BrowseComp and HLE.
arXiv.org
WebExplorer: Explore and Evolve for Training Long-Horizon Web Agents
The paradigm of Large Language Models (LLMs) has increasingly shifted toward agentic applications, where web browsing capabilities are fundamental for retrieving information from diverse online...
🔥3❤2👍2
Nvidia released La-Proteina fully open source
La-Proteina is generative model demonstrating accurate co-design of fully atomistic protein structures (sequence + side-chains + backbone) at scale, up to 800 residues, with state-of-the-art atomistic motif scaffolding performance - has just made its code open-source.
Paper.
Code.
La-Proteina is generative model demonstrating accurate co-design of fully atomistic protein structures (sequence + side-chains + backbone) at scale, up to 800 residues, with state-of-the-art atomistic motif scaffolding performance - has just made its code open-source.
Paper.
Code.
Nvidia
La-Proteina: Atomistic Protein Generation via Partially Latent Flow Matching
La-Proteina is a novel partially-latent fully atomistic protein design model. Protein backbone structure is modeled explicitly, while sequence and atomistic details are captured via per-residue latent variables. La-Proteina achieves state-of-the-art performance…
🔥3👍2🥰2
Medra AI has automated experimentation down to the physical level with reasoning and robotics.
The Medra technology platform consists of two core components:
1. Physical AI: Their general-purpose robots use vision-language models (VLMs) to operate standard laboratory instruments flexibly and execute experimental protocols. Medra is the first company to deploy Physical AI in the laboratory, leveraging the same advanced models that power self-driving cars and humanoid robots.
2. Scientific AI: Their reasoning models analyze experimental results and integrate with partners' internal infrastructure—such as LIMS, electronic lab notebooks, and ML pipelines—to glean insights from disparate data sources.
These two systems operate in a closed loop: Physical AI executes experiments while Scientific AI analyzes the outcomes and iterates on the design. This cycle helps scientists rapidly converge on the optimal protocol.
The Medra technology platform consists of two core components:
1. Physical AI: Their general-purpose robots use vision-language models (VLMs) to operate standard laboratory instruments flexibly and execute experimental protocols. Medra is the first company to deploy Physical AI in the laboratory, leveraging the same advanced models that power self-driving cars and humanoid robots.
2. Scientific AI: Their reasoning models analyze experimental results and integrate with partners' internal infrastructure—such as LIMS, electronic lab notebooks, and ML pipelines—to glean insights from disparate data sources.
These two systems operate in a closed loop: Physical AI executes experiments while Scientific AI analyzes the outcomes and iterates on the design. This cycle helps scientists rapidly converge on the optimal protocol.
www.medra.ai
Physical AI in the Lab: Unlocking Data for Scientific Breakthroughs
Medra is building new AI technology to empower scientists in the lab.
❤4🥰2👏2👎1🔥1
ByteDance launched Seedream 4.0, an image generation tool that aims to compete with Google's “Nano Banana” AI image editor.
⚡️ Claude now has memory. Anthropic also introduced incognito chats for all users.
With project-scoped memory, each project maintains its own focused context.
Memory is fully optional with granular controls.
In settings, view the complete memory summary, edit what's stored, and guide Claude by telling it what to focus on or ignore.
With project-scoped memory, each project maintains its own focused context.
Memory is fully optional with granular controls.
In settings, view the complete memory summary, edit what's stored, and guide Claude by telling it what to focus on or ignore.
Claude
Bringing memory to teams | Claude
Today, we’re introducing memory to the Claude app, where Claude remembers you and your team’s projects and preferences, eliminating the need to re-explain context and keeping complex work moving forward.
🔥4❤3👏3
Anthropic shared the best tips for developers how to writing effective tools for LLM agents.
Anthropic
Writing effective tools for AI agents—using AI agents
🔥6❤2🥰2
Meet Gauss the first autoformalization agent that just completed Terry Tao & Alex Kontorovich's Strong Prime Number Theorem project in 3 weeks—an effort that took human experts 18+ months of partial progress.
GitHub.
Early access.
GitHub.
Early access.
GitHub
GitHub - math-inc/strongpnt
Contribute to math-inc/strongpnt development by creating an account on GitHub.
❤6🔥2👏2
Google presented Speculative Cascades is a new approach for improving LLM efficiency that combines the best features of both cascades (where a small LLM precedes a larger LLM) and speculative decoding (which uses a drafter model verified by a target model).
research.google
Speculative cascades — A hybrid approach for smarter, faster LLM inference
Google shared a new work:
Virtual Agent Economies
Researchers discussed a number of possible frameworks for establishing steerable agent markets.
The rapid adoption of AI agents points to a future where AI agents may be able to produce economic value independently of human labor.
Coupled with the development of new interoperability standards like the Agent2Agent (A2A) and Model Context Protocol (MCP), this signals the inevitable emergence of a new economic layer.
The arising virtual (sandbox) AI agent economy may offer us opportunities for insulation and safeguarding, as well as establishing potentially unprecedented coordination between agents, and orchestrating their interactions towards achieving major societal or community goals, or better aligning with user preferences.
Market-based mechanisms like auctions may also be employed for fair resource allocation.
Finally, outline the technical and governance infrastructure—such as verifiable credentials for establishing trust—required to safely and robustly scale agentic AI deployments. These are necessary to address systemic market risks, and prevent exacerbating inequalities.
Virtual Agent Economies
Researchers discussed a number of possible frameworks for establishing steerable agent markets.
The rapid adoption of AI agents points to a future where AI agents may be able to produce economic value independently of human labor.
Coupled with the development of new interoperability standards like the Agent2Agent (A2A) and Model Context Protocol (MCP), this signals the inevitable emergence of a new economic layer.
The arising virtual (sandbox) AI agent economy may offer us opportunities for insulation and safeguarding, as well as establishing potentially unprecedented coordination between agents, and orchestrating their interactions towards achieving major societal or community goals, or better aligning with user preferences.
Market-based mechanisms like auctions may also be employed for fair resource allocation.
Finally, outline the technical and governance infrastructure—such as verifiable credentials for establishing trust—required to safely and robustly scale agentic AI deployments. These are necessary to address systemic market risks, and prevent exacerbating inequalities.
arXiv.org
Virtual Agent Economies
The rapid adoption of autonomous AI agents is giving rise to a new economic layer where agents transact and coordinate at scales and speeds beyond direct human oversight. We propose the "sandbox...
👍4👏3🔥2
OpenAI introduced openai grove: a program for early stage founders.
Grove builds on work with openai for startups & pioneers, and includes 5 weeks of hands-on workshops, office hours, events with our team, and early access.
Grove builds on work with openai for startups & pioneers, and includes 5 weeks of hands-on workshops, office hours, events with our team, and early access.
Openai
Apply to OpenAI Grove
A program for individuals early in their company building journey.
🔥3👏3👍2
UAE released K2-Think open-source AI reasoning model
32 billion parameters. That's it. And this thing is matching GPT-4 level reasoning while being 20x smaller.
A 32B parameter reasoning model that matches or beats models much larger in size.
It is built on Qwen2.5 32B and trained with long chain-of-thought examples, so it learns to show its reasoning step by step.
Then reinforcement learning is added, using tasks where answers can be checked automatically, like math or code, so the model improves by being rewarded for correct results.
At test time, two tricks are used. First, a helper model writes a short plan before solving, which gives structure. Second, the system generates 3 answers and another model picks the best, which improves accuracy and keeps responses shorter.
Speed is handled with specialized hardware, the Cerebras Wafer Scale Engine, which delivers about 2,000 tokens per second. This makes even very long reasoning tasks run in seconds instead of minutes.
32 billion parameters. That's it. And this thing is matching GPT-4 level reasoning while being 20x smaller.
A 32B parameter reasoning model that matches or beats models much larger in size.
It is built on Qwen2.5 32B and trained with long chain-of-thought examples, so it learns to show its reasoning step by step.
Then reinforcement learning is added, using tasks where answers can be checked automatically, like math or code, so the model improves by being rewarded for correct results.
At test time, two tricks are used. First, a helper model writes a short plan before solving, which gives structure. Second, the system generates 3 answers and another model picks the best, which improves accuracy and keeps responses shorter.
Speed is handled with specialized hardware, the Cerebras Wafer Scale Engine, which delivers about 2,000 tokens per second. This makes even very long reasoning tasks run in seconds instead of minutes.
👏3🔥2🥰2
OpenAI released GPT-5-Codex — a version of GPT-5 further optimized for agentic coding in Codex.
Available in the Codex CLI, IDE Extension, web, mobile, and for code reviews in Github.
Available in the Codex CLI, IDE Extension, web, mobile, and for code reviews in Github.
Openai
Introducing upgrades to Codex
Codex just got faster, more reliable, and better at real-time collaboration and tackling tasks independently anywhere you develop—whether via the terminal, IDE, web, or even your phone.
ByteDance introduced EMPG, a framework that recalibrates the learning signal using the agent's own uncertainty.
Comparing with GRPO and DAPO, it achieves promising gains on agent benchmarks like WebShop, ALFWorld, & Deep Search.
Paper.
Comparing with GRPO and DAPO, it achieves promising gains on agent benchmarks like WebShop, ALFWorld, & Deep Search.
Paper.
🔥4❤3👏2
Google launched new protocol for agent-driven purchases
Google announced a new open protocol for purchases initiated by AI agents — automated software programs that can shop and make decisions on behalf of users. AI payments protocol supporting credit cards and stablecoins, built with Coinbase, the Ethereum Foundation and over 60 partners, per Fortune. GitHub.
Called the Agent Payments Protocol (AP2), the system is meant to be interoperable between AI platforms, payment systems and vendors, providing a traceable paper trail for each transaction.
In collaboration with cryptocurrency outfits Coinbase, Metamask and the Ethereum foundation, Google also produced an extension that would integrate the cryptocurrency-oriented x402 protocol, allowing for AI-driven purchasing from crypto wallets.
A number of other tech companies are working on their own agentic purchasing systems — most notably Perplexity, which allows for a Buy With Pro service in its agentic browser. The payment provider Stripe also produces software tools for agentic purchasing on its platform, though they are not as comprehensive as AP2.
Google announced a new open protocol for purchases initiated by AI agents — automated software programs that can shop and make decisions on behalf of users. AI payments protocol supporting credit cards and stablecoins, built with Coinbase, the Ethereum Foundation and over 60 partners, per Fortune. GitHub.
Called the Agent Payments Protocol (AP2), the system is meant to be interoperable between AI platforms, payment systems and vendors, providing a traceable paper trail for each transaction.
In collaboration with cryptocurrency outfits Coinbase, Metamask and the Ethereum foundation, Google also produced an extension that would integrate the cryptocurrency-oriented x402 protocol, allowing for AI-driven purchasing from crypto wallets.
A number of other tech companies are working on their own agentic purchasing systems — most notably Perplexity, which allows for a Buy With Pro service in its agentic browser. The payment provider Stripe also produces software tools for agentic purchasing on its platform, though they are not as comprehensive as AP2.
TechCrunch
Google launches new protocol for agent-driven purchases | TechCrunch
Called the Agent Payments Protocol (AP2), the system is meant to be interoperable between AI platforms, payment systems, and vendors.
👍3❤2🥰2👏2
That's a lot of money for robots: Figure has exceeded $1B in funding at a $39B post-money valuation
The round was led by Parkway Venture Capital with significant investments from Brookfield Asset Management, NVIDIA, Macquarie Capital, Intel Capital, Align Ventures, Tamarack Global, LG Technology Ventures, Salesforce, T-Mobile Ventures, and Qualcomm Ventures.
A new funding will support Figure's momentum across three core areas:
1. Scaling humanoid robots into homes & commercial operations
2. Building next-generation GPU infrastructure to accelerate training & simulation
3. Launching advanced data collection efforts for Helix
The round was led by Parkway Venture Capital with significant investments from Brookfield Asset Management, NVIDIA, Macquarie Capital, Intel Capital, Align Ventures, Tamarack Global, LG Technology Ventures, Salesforce, T-Mobile Ventures, and Qualcomm Ventures.
A new funding will support Figure's momentum across three core areas:
1. Scaling humanoid robots into homes & commercial operations
2. Building next-generation GPU infrastructure to accelerate training & simulation
3. Launching advanced data collection efforts for Helix
FigureAI
Figure Exceeds $1B in Series C Funding at $39B Post-Money Valuation
❤5🔥4👏2
Tongyi Lab dropped half a dozen new papers, most focused on Deep Research agents.
1. Tongyi DeepResearch: Open-source DeepResearch Agent
• First OSS web agent matching OpenAI’s DeepResearch
• SOTA on HLE (32.9), BrowseComp (43.4/46.7), xbench-DeepSearch (75)
• Full-stack pipeline: Agentic CPT → SFT → RL w/ synthetic data
• Native ReAct & new Heavy Mode (IterResearch) for long-horizon tasks
2. WebResearcher: Unbounded reasoning for long-horizon agents
• IterResearch: Iterative deep-research paradigm (avoids context suffocation & noise)
• WebFrontier: Tool-augmented data engine for complex research tasks
• Parallel agents + synthesis → scalable, evidence-grounded reasoning
• Beats proprietary systems: 36.7% on HLE, 51.7% on BrowseComp
3. AgentScaler: Towards General Agentic Intelligence
• Scales environments for diverse, realistic tool-calling
• Fully simulated envs = verifiable + scalable interactions
• SOTA on τ-bench, τ²-bench, ACEBench
• AgentScaler-30B matches 1T-parameter models with far fewer params
4. AgentFounder: Scaling Agents via Continual Pre-training
• First to propose Agentic CPT → builds agentic foundation models before fine-tuning
• Solves post-training bottlenecks (capabilities + alignment conflict)
• Data synthesis: First-order (planning/actions) + Higher-order (multi-step decision)
• Two-stage training (32K → 128K context)
• SOTA: 39.9% BrowseComp-en, 72.8% GAIA
5. WebWeaver: Structuring Web-Scale Evidence for Deep Research
• Dual-agent framework (Planner + Writer)
• Dynamic outlines: search ↔ refine ↔ search (human-like loop)
• Memory-grounded, section-by-section synthesis → avoids long-context failures
• SOTA across DeepResearch Bench, DeepConsult, DeepResearchGym
• Produces reliable, well-cited, structured reports
6. ReSum: Long-Horizon Web Agents Without Context Limits
• Problem: ReAct hits context limits in long searches (32k tokens)
• Solution: ReSum periodically compresses history → compact reasoning states
• ReSumTool-30B: specialized summarizer extracts key evidence & gaps
• ReSum-GRPO (RL): trains agents to adapt summaries into reasoning
• +4.5% over ReAct baseline, +8.2% with RL across web search benchmarks.
1. Tongyi DeepResearch: Open-source DeepResearch Agent
• First OSS web agent matching OpenAI’s DeepResearch
• SOTA on HLE (32.9), BrowseComp (43.4/46.7), xbench-DeepSearch (75)
• Full-stack pipeline: Agentic CPT → SFT → RL w/ synthetic data
• Native ReAct & new Heavy Mode (IterResearch) for long-horizon tasks
2. WebResearcher: Unbounded reasoning for long-horizon agents
• IterResearch: Iterative deep-research paradigm (avoids context suffocation & noise)
• WebFrontier: Tool-augmented data engine for complex research tasks
• Parallel agents + synthesis → scalable, evidence-grounded reasoning
• Beats proprietary systems: 36.7% on HLE, 51.7% on BrowseComp
3. AgentScaler: Towards General Agentic Intelligence
• Scales environments for diverse, realistic tool-calling
• Fully simulated envs = verifiable + scalable interactions
• SOTA on τ-bench, τ²-bench, ACEBench
• AgentScaler-30B matches 1T-parameter models with far fewer params
4. AgentFounder: Scaling Agents via Continual Pre-training
• First to propose Agentic CPT → builds agentic foundation models before fine-tuning
• Solves post-training bottlenecks (capabilities + alignment conflict)
• Data synthesis: First-order (planning/actions) + Higher-order (multi-step decision)
• Two-stage training (32K → 128K context)
• SOTA: 39.9% BrowseComp-en, 72.8% GAIA
5. WebWeaver: Structuring Web-Scale Evidence for Deep Research
• Dual-agent framework (Planner + Writer)
• Dynamic outlines: search ↔ refine ↔ search (human-like loop)
• Memory-grounded, section-by-section synthesis → avoids long-context failures
• SOTA across DeepResearch Bench, DeepConsult, DeepResearchGym
• Produces reliable, well-cited, structured reports
6. ReSum: Long-Horizon Web Agents Without Context Limits
• Problem: ReAct hits context limits in long searches (32k tokens)
• Solution: ReSum periodically compresses history → compact reasoning states
• ReSumTool-30B: specialized summarizer extracts key evidence & gaps
• ReSum-GRPO (RL): trains agents to adapt summaries into reasoning
• +4.5% over ReAct baseline, +8.2% with RL across web search benchmarks.
🔥5❤4👏3
Anthropic shipped two updates for developers using Claude
1. Claude in Xcode 26 Claude Sonnet 4 is now available as a coding assistant directly in Apple's IDE. Developers can connect their Claude account to access natural language code interaction, documentation generation, and inline editing tools. The integration shares usage limits with other Claude platforms and works with Pro, Max, and premium Team/Enterprise plans.
2. Claude Code UX Update A small but useful interface improvement: keywords like "think" and "ultrathink" now get highlighted when they would trigger extended thinking mode. Use
1. Claude in Xcode 26 Claude Sonnet 4 is now available as a coding assistant directly in Apple's IDE. Developers can connect their Claude account to access natural language code interaction, documentation generation, and inline editing tools. The integration shares usage limits with other Claude platforms and works with Pro, Max, and premium Team/Enterprise plans.
2. Claude Code UX Update A small but useful interface improvement: keywords like "think" and "ultrathink" now get highlighted when they would trigger extended thinking mode. Use
/t to disable the mode, preventing accidental activation when these words appear in regular prompts.Anthropic
Claude is now generally available in Xcode
Connect your Claude account to Xcode 26 for AI-powered coding assistance. Debug, refactor, and build Apple apps faster with Claude Sonnet 4 by Anthropic.
🔥3❤2👏2