Google, UC Berkeley and an international team of researchers present Aletheia, a math research agent built on Gemini
The system uses AI to systematically scan hundreds of complex conjectures, filtering through potential proofs with natural language verification before sending the best candidates to human experts for final review.
The team resolved 13 "open" problems from the ErdΕs database, generating 4 brand-new solutions and identifying 9 others that were actually solved in obscure corners of existing literature.
The system uses AI to systematically scan hundreds of complex conjectures, filtering through potential proofs with natural language verification before sending the best candidates to human experts for final review.
The team resolved 13 "open" problems from the ErdΕs database, generating 4 brand-new solutions and identifying 9 others that were actually solved in obscure corners of existing literature.
β€2π₯2π2
Bytedance dropped advanced video generation model
Seedance 2.0 has:
β native audio gen (lipsynced speech + music)
β drastic step up from Veo 3.1 / Sora 2 in quality
β supports multimodal input
β 2k resolution
Goes beyond cinematic video, and can do product demos as well. And it's really hard to tell it's AI.
Seedance 2.0 has:
β native audio gen (lipsynced speech + music)
β drastic step up from Veo 3.1 / Sora 2 in quality
β supports multimodal input
β 2k resolution
Goes beyond cinematic video, and can do product demos as well. And it's really hard to tell it's AI.
WaveSpeedAI
Seedance 2.0 Complete Guide: Multimodal Video Creation - WaveSpeed Blog
Seedance 2.0 is now live on WaveSpeedAI. Master its multimodal video generation with this comprehensive guide β combine images, videos, audio, and text for precise control over motion, style, and storytelling.
π₯3π3β€2
The PaddleOCR Document Parsing Skill is now live on ClawHub, ready to plug directly into OpenClaw workflows.
Instead of deploying OCR services or wiring APIs, developers can now invoke PaddleOCR as a standardized composable Skill node β embedding document understanding directly into Agents and automation pipelines.
Built on PaddleOCR-VL-1.5, the Skill delivers
1. Multi-format parsing (PDF, JPG, PNG, BMP, TIFF)
2. Layout analysis β text, tables, formulas, headers
3. 110+ language coverage
4. Structured Markdown output preserving hierarchy
No deployment. No wrappers. Just configuration β and build your document intelligence chain inside OpenClaw.
Instead of deploying OCR services or wiring APIs, developers can now invoke PaddleOCR as a standardized composable Skill node β embedding document understanding directly into Agents and automation pipelines.
Built on PaddleOCR-VL-1.5, the Skill delivers
1. Multi-format parsing (PDF, JPG, PNG, BMP, TIFF)
2. Layout analysis β text, tables, formulas, headers
3. 110+ language coverage
4. Structured Markdown output preserving hierarchy
No deployment. No wrappers. Just configuration β and build your document intelligence chain inside OpenClaw.
GitHub
GitHub - PaddlePaddle/PaddleOCR: Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkitβ¦
Turn any PDF or image document into structured data for your AI. A powerful, lightweight OCR toolkit that bridges the gap between images/PDFs and LLMs. Supports 100+ languages. - PaddlePaddle/Paddl...
π₯4β€3π3π€1
What if your model could learn from its own drafts during RL training?
NVIDIA introduced iGRPO: Iterative Group Relative Policy Optimization.
Researchers add a self-feedback loop to GRPO: the model drafts multiple solutions, picks its best one, then learns to refine beyond it.
Core idea:
Stage 1 β explore and select your strongest attempt. Stage 2 β condition on that attempt and beat it.
Same scalar reward. No critics, no generated critiques, no verification text. The best draft is the only feedback the model needs.
Results across 7B / 8B / 14B models:
β’ Nemotron-H-8B-Base-8K: 41.1% β 45.0% (+3.96 over GRPO)
β’ DeepSeek-R1-Distill-Qwen-7B: 68.3% β 69.9%
β’ OpenMath-Nemotron-14B: 76.7% β 78.0%
β’ OpenReasoning-Nemotron-7B on AceReason-Math: 85.62% AIME24 / 79.64% AIME25
The same two-stage wrapper also improves DAPO and GSPO. It's not tied to GRPO at all.
NVIDIA introduced iGRPO: Iterative Group Relative Policy Optimization.
Researchers add a self-feedback loop to GRPO: the model drafts multiple solutions, picks its best one, then learns to refine beyond it.
Core idea:
Stage 1 β explore and select your strongest attempt. Stage 2 β condition on that attempt and beat it.
Same scalar reward. No critics, no generated critiques, no verification text. The best draft is the only feedback the model needs.
Results across 7B / 8B / 14B models:
β’ Nemotron-H-8B-Base-8K: 41.1% β 45.0% (+3.96 over GRPO)
β’ DeepSeek-R1-Distill-Qwen-7B: 68.3% β 69.9%
β’ OpenMath-Nemotron-14B: 76.7% β 78.0%
β’ OpenReasoning-Nemotron-7B on AceReason-Math: 85.62% AIME24 / 79.64% AIME25
The same two-stage wrapper also improves DAPO and GSPO. It's not tied to GRPO at all.
arXiv.org
iGRPO: Self-Feedback-Driven LLM Reasoning
Large Language Models (LLMs) have shown promise in solving complex mathematical problems, yet they still fall short of producing accurate and consistent solutions. Reinforcement Learning (RL) is a...
β€4π₯3π3
Google introduced DialogLab a new open-source prototyping framework, uses a human-in-the-loop control strategy to achieve realistic human-AI group simulation, offering a necessary alternative to fully autonomous agents.
Evaluations with domain experts found that its "Human Control" mode (where you can edit, accept, or dismiss real-time AI suggestions) was preferred in realism, effectiveness, and engagement.
DialogLab transforms dialogue design from rigid scripts to spontaneous, adaptable group dynamics.
Evaluations with domain experts found that its "Human Control" mode (where you can edit, accept, or dismiss real-time AI suggestions) was preferred in realism, effectiveness, and engagement.
DialogLab transforms dialogue design from rigid scripts to spontaneous, adaptable group dynamics.
Google Research
Beyond one-on-one: Authoring, simulating, and testing dynamic human-AI group conversations
DialogLab is a research prototype that provides a unified interface to configure conversational scenes, define agent personas, manage group structures, specify turn-taking rules, and orchestrate transitions between scripted narratives and improvisation.
β€2π₯2π2
This new research introduces Agyn, an open-source multi-agent platform that models software engineering as a team-based organizational process rather than a monolithic task.
The system configures a team of four specialized agents: a manager, researcher, engineer, and reviewer. Each operates within its own isolated sandbox with role-specific tools, prompts, and language model configurations. The manager agent coordinates dynamically based on intermediate outcomes rather than following a fixed pipeline.
What makes the design interesting?
Different agents use different models depending on their role. The manager and researcher run on GPT-5 for stronger reasoning and broader context. The engineer and reviewer use GPT-5-Codex, a smaller code-specialized model optimized for iterative implementation and debugging. This mirrors how real teams allocate resources based on task requirements.
The workflow follows a GitHub-native process. Agents analyze issues, create pull requests, conduct inline code reviews, and iterate through revision cycles until the reviewer explicitly approves. No human intervention at any point. The number of steps isn't predetermined. It emerges from task complexity.
The system configures a team of four specialized agents: a manager, researcher, engineer, and reviewer. Each operates within its own isolated sandbox with role-specific tools, prompts, and language model configurations. The manager agent coordinates dynamically based on intermediate outcomes rather than following a fixed pipeline.
What makes the design interesting?
Different agents use different models depending on their role. The manager and researcher run on GPT-5 for stronger reasoning and broader context. The engineer and reviewer use GPT-5-Codex, a smaller code-specialized model optimized for iterative implementation and debugging. This mirrors how real teams allocate resources based on task requirements.
The workflow follows a GitHub-native process. Agents analyze issues, create pull requests, conduct inline code reviews, and iterate through revision cycles until the reviewer explicitly approves. No human intervention at any point. The number of steps isn't predetermined. It emerges from task complexity.
arXiv.org
Agyn: A Multi-Agent System for Team-Based Autonomous Software Engineering
Large language models have demonstrated strong capabilities in individual software engineering tasks, yet most autonomous systems still treat issue resolution as a monolithic or pipeline-based...
π₯3β€2π2
Stripe launched (a preview) of machine payments a way for developers to directly charge agents, with a few lines of code.
Stripe launched with support for x402 using USDC stablecoins on base, with more protocols, payment methods, currencies, and chains to come.
And sales tax, refunds, and reporting just work. (You only need to think about crypto if you want to!)
Also released an open source cli called `purl` for you (and your bots) to test machine payments in the terminal, along with Node and Python samples. Yes, payments + curl creatively smushed together.
Stripe launched with support for x402 using USDC stablecoins on base, with more protocols, payment methods, currencies, and chains to come.
And sales tax, refunds, and reporting just work. (You only need to think about crypto if you want to!)
Also released an open source cli called `purl` for you (and your bots) to test machine payments in the terminal, along with Node and Python samples. Yes, payments + curl creatively smushed together.
Stripe
Machine payments
Machine payments allows automated systems and AI agents to make payments on behalf of users.
β€3π₯3π2
Google is adding a way for consumers to buy things while seeking AI powered answers on search and in its Gemini chatbot β part of a plan to make money more directly from consumersβ AI use.
Bloomberg.com
Google Pushes AI Shopping Features in Search and Gemini Chatbot
Google is adding a way for consumers to buy things while seeking artificial intelligence-powered answers on search and in its Gemini chatbot β part of a plan to make money more directly from consumersβ AI use.
β€2π2π₯2
OpenAI announced new primitives for building agents.
Openai
Shell + Skills + Compaction: Tips for long-running agents that do real work | OpenAI Developers
Practical patterns for building with skills, hosted shell, and server-side compaction in the Responses API.
β€3π₯3π―3
Zhipu released GLM-5
The model is open source. It matches Claude Opus 4.5 on coding benchmarks. Beats Gemini 3 Pro on some tests. But the interesting part isn't the benchmarks.
GLM-5 is built for agents. The company designed it for long-running tasks and tool invocation. In the ΟΒ²-Bench interactive tool evaluation, it scored 84.7, beating Claude Sonnet 4.5.
Think about what that means. A model designed to work inside coding environments like Claude Code, Kilo Code, and Cline. "Think before you act" mechanisms baked into the architecture. Better planning for complex multi-step tasks.
Zhipu's traffic has jumped five-fold recently. The company had to implement subscription limits to handle demand. Most of that demand is coming from the US and China, followed by India, Japan, and Brazil.
The release pace is accelerating. GLM-4.6 came out in September. GLM-4.7 in January. GLM-5 in February. That's three major versions in six months.
DeepSeek proved that open models can spread fast when they're genuinely good. Zhipu is following the same playbook. Open weights, strong coding performance, agent optimization.
7 of the top 10 AI models on current leaderboards are now Chinese. The competition isn't just about who has the smartest model anymore. It's about who builds the best tools for developers.
The model is open source. It matches Claude Opus 4.5 on coding benchmarks. Beats Gemini 3 Pro on some tests. But the interesting part isn't the benchmarks.
GLM-5 is built for agents. The company designed it for long-running tasks and tool invocation. In the ΟΒ²-Bench interactive tool evaluation, it scored 84.7, beating Claude Sonnet 4.5.
Think about what that means. A model designed to work inside coding environments like Claude Code, Kilo Code, and Cline. "Think before you act" mechanisms baked into the architecture. Better planning for complex multi-step tasks.
Zhipu's traffic has jumped five-fold recently. The company had to implement subscription limits to handle demand. Most of that demand is coming from the US and China, followed by India, Japan, and Brazil.
The release pace is accelerating. GLM-4.6 came out in September. GLM-4.7 in January. GLM-5 in February. That's three major versions in six months.
DeepSeek proved that open models can spread fast when they're genuinely good. Zhipu is following the same playbook. Open weights, strong coding performance, agent optimization.
7 of the top 10 AI models on current leaderboards are now Chinese. The competition isn't just about who has the smartest model anymore. It's about who builds the best tools for developers.
π3π₯2π2π2
The agent economy just got a real marketplace
Moltlaunch is live on Base. Browse specialized AI agents, hire them for real work, and back the ones you believe in.
Every completed job burns tokens and leaves a review onchain through ERC-8004.
Moltlaunch is live on Base. Browse specialized AI agents, hire them for real work, and back the ones you believe in.
Every completed job burns tokens and leaves a review onchain through ERC-8004.
Moltlaunch
moltlaunch β hire AI agents, pay with ETH
The agent marketplace and open protocol for agent work. Trustless escrow, permanent reputation, tradeable tokens on Base.
π₯5β€2π2
Does being a math genius make an AI better at understanding human intentions?
Researchers from Arizona State University and Microsoft Research Asia investigated whether the step-by-step logic used for coding helps AI master Theory of Mindβthe ability to sense what others are thinking and feeling.
The results show that more thinking time can actually cause social reasoning to collapse, with advanced reasoning models often being outperformed by simpler ones. Unlike in math or code, these models frequently rely on answer-matching shortcuts rather than true deduction, proving that social intelligence requires a unique approach beyond existing reasoning methods.
Researchers from Arizona State University and Microsoft Research Asia investigated whether the step-by-step logic used for coding helps AI master Theory of Mindβthe ability to sense what others are thinking and feeling.
The results show that more thinking time can actually cause social reasoning to collapse, with advanced reasoning models often being outperformed by simpler ones. Unlike in math or code, these models frequently rely on answer-matching shortcuts rather than true deduction, proving that social intelligence requires a unique approach beyond existing reasoning methods.
arXiv.org
To Think or Not To Think, That is The Question for Large Reasoning...
Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in...
π₯4π₯°3π2
OpenClaw is cool, but too large?
Hong Kong released nanobot to solve this exact problem.
Researchers transformed the massive OpenClaw system into a clean 4,000-line Python framework that focuses on a simple loop: receive input, let the AI think, and execute tools like file management or web searches.
It strips away complex abstractions to focus on clear, modular function calls that any developer can understand.
By slashing code complexity by 99 percent, they achieved full functional parity with a 2-minute deployment time, making it significantly easier to customize and learn than traditional bloated agent architectures.
Hong Kong released nanobot to solve this exact problem.
Researchers transformed the massive OpenClaw system into a clean 4,000-line Python framework that focuses on a simple loop: receive input, let the AI think, and execute tools like file management or web searches.
It strips away complex abstractions to focus on clear, modular function calls that any developer can understand.
By slashing code complexity by 99 percent, they achieved full functional parity with a 2-minute deployment time, making it significantly easier to customize and learn than traditional bloated agent architectures.
GitHub
GitHub - HKUDS/nanobot: Lightweight, open-source AI agent for your tools, chats, and workflows.
Lightweight, open-source AI agent for your tools, chats, and workflows. - HKUDS/nanobot
π5π3π₯3β€2
Researchers from Huazhong University of Science and Technology and ByteDance Seed just introduced Stable-DiffCoder.
Instead of writing code one token at a time like standard models, this method uses a block diffusion approach to generate and refine code chunks simultaneously, resulting in more stable and structured programming.
The results show it outperforms its autoregressive counterparts and various 8B-parameter models on major benchmarks, specifically excelling in code editing, logical reasoning, and low-resource programming languages.
Code
Models.
Instead of writing code one token at a time like standard models, this method uses a block diffusion approach to generate and refine code chunks simultaneously, resulting in more stable and structured programming.
The results show it outperforms its autoregressive counterparts and various 8B-parameter models on major benchmarks, specifically excelling in code editing, logical reasoning, and low-resource programming languages.
Code
Models.
π3β€2π₯2π₯°2
Google shared new work on envisioning Intelligent AI Delegation
As they've discussed previously, the expansion of the agentic web opens up new opportunities for establishing virtual agentic economies and steerable markets.
Collective intelligence is likely to play an increasingly important role in the coming period, as complex tasks may get distributed across nodes, where each agent may be able to leverage their unique skills and differential access to tools, libraries, and data, to more efficiently and effectively handle sub-tasks that are distributed across the network.
Yet, delegation is more than just task decomposition into manageable sub-units of action. Beyond the creation of sub-tasks, delegation necessitates the assignment of responsibility and authority and thus implicates accountability for outcomes. Delegation thus involves risk assessment, which can be moderated by trust. Delegation further involves capability matching and continuous performance monitoring, incorporating dynamic adjustments based on feedback, and ensuring completion of the distributed task under the specified constraints.
There is a pressing need for Intelligent Delegation - a robust framework centered around clear roles, boundaries, reputation, trust, transparency, certifiable agentic capabilities, verifiable task execution, and scalable task distribution.
Googleβs framework thus proposed intelligent AI delegation that incorporates components for dynamic assessment, adaptive execution, structural transparency, scalable market coordination, and systemic resilience. Google proposed a framework that adapts the approach based on the criticality of the task at hand, its reversibility, resource requirements, complexity, projected duration, and other important properties.
Google introduced a notion of contract-first decomposition as a binding constraint, rendering task delegation is contingent upon the outcome having precise verification.
As they've discussed previously, the expansion of the agentic web opens up new opportunities for establishing virtual agentic economies and steerable markets.
Collective intelligence is likely to play an increasingly important role in the coming period, as complex tasks may get distributed across nodes, where each agent may be able to leverage their unique skills and differential access to tools, libraries, and data, to more efficiently and effectively handle sub-tasks that are distributed across the network.
Yet, delegation is more than just task decomposition into manageable sub-units of action. Beyond the creation of sub-tasks, delegation necessitates the assignment of responsibility and authority and thus implicates accountability for outcomes. Delegation thus involves risk assessment, which can be moderated by trust. Delegation further involves capability matching and continuous performance monitoring, incorporating dynamic adjustments based on feedback, and ensuring completion of the distributed task under the specified constraints.
There is a pressing need for Intelligent Delegation - a robust framework centered around clear roles, boundaries, reputation, trust, transparency, certifiable agentic capabilities, verifiable task execution, and scalable task distribution.
Googleβs framework thus proposed intelligent AI delegation that incorporates components for dynamic assessment, adaptive execution, structural transparency, scalable market coordination, and systemic resilience. Google proposed a framework that adapts the approach based on the criticality of the task at hand, its reversibility, resource requirements, complexity, projected duration, and other important properties.
Google introduced a notion of contract-first decomposition as a binding constraint, rendering task delegation is contingent upon the outcome having precise verification.
arXiv.org
Intelligent AI Delegation
AI agents are able to tackle increasingly complex tasks. To achieve more ambitious goals, AI agents need to be able to meaningfully decompose problems into manageable sub-components, and safely...
π₯4β€2π2
MiniMax Introduced M2.5
Trained with Rl across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows.
At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible.
GitHub.
Trained with Rl across hundreds of thousands of complex real-world environments, it delivers SOTA performance in coding, agentic tool use, search, and office workflows.
At $1 per hour with 100 tps, infinite scaling of long-horizon agents now economically possible.
GitHub.
GitHub
GitHub - MiniMax-AI/MiniMax-M2.5
Contribute to MiniMax-AI/MiniMax-M2.5 development by creating an account on GitHub.
β€1π₯1π1
Moonshot AI Introduced Kimi Claw
OpenClaw, now native to kimi.com.
1. ClawHub Access: 5,000+ community skills in the ClawHub library.
2. 40GB Cloud Storage: Massive space for all your files
3. Pro-Grade Search: Fetch live, high-quality data directly from Yahoo Finance and more.
4. Bring Your Own Claw: Connect your third-party OpenClaw to kimi.com, chat with your setup, or bridge it to apps like Telegram groups.
OpenClaw, now native to kimi.com.
1. ClawHub Access: 5,000+ community skills in the ClawHub library.
2. 40GB Cloud Storage: Massive space for all your files
3. Pro-Grade Search: Fetch live, high-quality data directly from Yahoo Finance and more.
4. Bring Your Own Claw: Connect your third-party OpenClaw to kimi.com, chat with your setup, or bridge it to apps like Telegram groups.
Kimi
Kimi Claw | 24/7 AI Agent, Now with Claw Groups (Preview)
Deploy OpenClaw in minutes to build a 24/7 AI agent with memory and scheduled tasks. Experience Claw Groups (Preview) for multi-agent and human collaboration in shared groups.
π6π₯2π₯°2
Meet Qwen3.5-397B-A17B an open-weight vision-language model.
Built for the future of coding, reasoning, and seamless multimodal interaction.
Key Highlights:
Inference Efficiency: A massive 397B total parameters, but only 17B activeβdelivering flagship power at a fraction of the cost.
Hybrid Architecture: Innovative Gated Delta Networks (Linear Attention) + Sparse MoE for extreme speed.
True Multimodality: Exceptional performance across GUI interaction, video comprehension, and agentic workflows.
Global Scale: Qwen3.5 now supports over 200 languages.
Empowering developers and enterprises to build smarter, faster, and more versatile AI agents
Built for the future of coding, reasoning, and seamless multimodal interaction.
Key Highlights:
Inference Efficiency: A massive 397B total parameters, but only 17B activeβdelivering flagship power at a fraction of the cost.
Hybrid Architecture: Innovative Gated Delta Networks (Linear Attention) + Sparse MoE for extreme speed.
True Multimodality: Exceptional performance across GUI interaction, video comprehension, and agentic workflows.
Global Scale: Qwen3.5 now supports over 200 languages.
Empowering developers and enterprises to build smarter, faster, and more versatile AI agents
π2π₯1π₯°1π1
A Chinese hardware team introduced PicoClaw
They took a 430,000-line AI assistant that needs a $599 Mac Mini and 1GB of RAM β and rewrote it in Go so it runs on a $9.9 dev board with less than 10MB of memory.
Boot time: from 500 seconds to 1 second.
Cost: from $599 to $9.9.
Memory: from 1GB to 10MB.
Same features: code generation, web search, Discord/Telegram chat, memory system, scheduled tasks, security sandbox.
The wildest part? They claim 95% of the new codebase was written by AI agents themselves. The humans just guided the architecture. It's an AI assistant that literally rebuilt itself to be smaller.
Launched February 9th. Four days later: 7,400+ GitHub stars.
This is the pattern no one's talking about enough.
Every AI capability that starts expensive gets commoditized within months. GPT-4 level models went open source in 6 months. Now the hardware floor for running a personal AI agent just dropped 60x in weeks.
The infrastructure moat in AI isn't sustainable. The only defensible advantage is what you do with these tools β not access to them.
They took a 430,000-line AI assistant that needs a $599 Mac Mini and 1GB of RAM β and rewrote it in Go so it runs on a $9.9 dev board with less than 10MB of memory.
Boot time: from 500 seconds to 1 second.
Cost: from $599 to $9.9.
Memory: from 1GB to 10MB.
Same features: code generation, web search, Discord/Telegram chat, memory system, scheduled tasks, security sandbox.
The wildest part? They claim 95% of the new codebase was written by AI agents themselves. The humans just guided the architecture. It's an AI assistant that literally rebuilt itself to be smaller.
Launched February 9th. Four days later: 7,400+ GitHub stars.
This is the pattern no one's talking about enough.
Every AI capability that starts expensive gets commoditized within months. GPT-4 level models went open source in 6 months. Now the hardware floor for running a personal AI agent just dropped 60x in weeks.
The infrastructure moat in AI isn't sustainable. The only defensible advantage is what you do with these tools β not access to them.
β€12π3π₯2π₯°2
NVIDIA dropped PersonaPlex-7B
A full-duplex voice model that listens and talks at the same time.
No pauses. No turn-taking. Real conversation.
100% open source. Free.
Voice AI just leveled up.
A full-duplex voice model that listens and talks at the same time.
No pauses. No turn-taking. Real conversation.
100% open source. Free.
Voice AI just leveled up.
huggingface.co
nvidia/personaplex-7b-v1 Β· Hugging Face
Weβre on a journey to advance and democratize artificial intelligence through open source and open science.
π₯7π₯°2π2
Can we train LLMs from scratch using only low-rank factorized weights and still match dense performance?
Short answer: yes (with care).
New work βStabilizing Native Low-Rank LLM Pretrainingβ.
Short answer: yes (with care).
New work βStabilizing Native Low-Rank LLM Pretrainingβ.
π₯3β€2π2