Hugging Face opened pre-orders for Reachy Mini, an expressive, open-source desktop robot
Starting at $299, the robot is designed for human-robot interaction, creative coding, and AI experimentation.
And it's fully programmable in Python.
Starting at $299, the robot is designed for human-robot interaction, creative coding, and AI experimentation.
And it's fully programmable in Python.
huggingface.co
Reachy Mini - The Open-Source Robot for Today's and Tomorrow's AI Builders
We’re on a journey to advance and democratize artificial intelligence through open source and open science.
🔥6
One Token to Fool LLM-as-a-Judge. New research from Tencent.
The researchers found that inserting superficial, semantically empty tokens like "Thought process:", "Solution:", or even just a colon ":" can consistently trick reward models into rating responses positively, regardless of actual correctness.
How it works: LLMs learned to associate certain formatting patterns with high-quality responses during training. These superficial markers now trigger positive evaluations even when the actual content is incorrect.
The failure mode emerged during RLVR training collapse - policy models learned to generate short reasoning openers that were incorrectly rewarded, creating a feedback loop that reinforced this behavior.
Scale dependency: Larger models (32B, 72B parameters) often self-validate their own flawed logic, making the problem worse at scale rather than better.
Experimental Results
Testing across five benchmarks showed consistent vulnerabilities:
Multi-subject RLVR: 67% average false positive rate
Natural Reasoning: 62% false positive rate
GSM8K: 83% false positive rate
Even simple punctuation marks like colons dramatically increased false positive rates across all tested models.
The Solution: Master-RM
Tencent's team developed "Master-RM" - a reward model trained with 20k synthetic negative samples consisting only of reasoning openers without actual solutions.
Results:
- Near-zero false positive rates across all benchmarks
- Maintains 96% agreement with GPT-4o on legitimate judgments
100% parsing success rate
- Robust generalization to unseen attack patterns
The researchers found that inserting superficial, semantically empty tokens like "Thought process:", "Solution:", or even just a colon ":" can consistently trick reward models into rating responses positively, regardless of actual correctness.
How it works: LLMs learned to associate certain formatting patterns with high-quality responses during training. These superficial markers now trigger positive evaluations even when the actual content is incorrect.
The failure mode emerged during RLVR training collapse - policy models learned to generate short reasoning openers that were incorrectly rewarded, creating a feedback loop that reinforced this behavior.
Scale dependency: Larger models (32B, 72B parameters) often self-validate their own flawed logic, making the problem worse at scale rather than better.
Experimental Results
Testing across five benchmarks showed consistent vulnerabilities:
Multi-subject RLVR: 67% average false positive rate
Natural Reasoning: 62% false positive rate
GSM8K: 83% false positive rate
Even simple punctuation marks like colons dramatically increased false positive rates across all tested models.
The Solution: Master-RM
Tencent's team developed "Master-RM" - a reward model trained with 20k synthetic negative samples consisting only of reasoning openers without actual solutions.
Results:
- Near-zero false positive rates across all benchmarks
- Maintains 96% agreement with GPT-4o on legitimate judgments
100% parsing success rate
- Robust generalization to unseen attack patterns
arXiv.org
One Token to Fool LLM-as-a-Judge
Large language models (LLMs) are increasingly trusted as automated judges, assisting evaluation and providing reward signals for training other models, particularly in reference-based settings...
🔥4❤1
Meet CellFlux an image generative model that simulates cellular morphological changes from microscopy images.
Key Innovation: researchers frame perturbation prediction as a distribution-to-distribution learning problem, mapping control cells to perturbed cells within the same batch to mitigate biological batch artifacts, and solve it using flow matching.
Results:
1. 35% higher image fidelity
2. 12% greater biological accuracy
3. New capabilities: batch effect correction & trajectory modeling
Key Innovation: researchers frame perturbation prediction as a distribution-to-distribution learning problem, mapping control cells to perturbed cells within the same batch to mitigate biological batch artifacts, and solve it using flow matching.
Results:
1. 35% higher image fidelity
2. 12% greater biological accuracy
3. New capabilities: batch effect correction & trajectory modeling
yuhui-zh15.github.io
CellFlux: Simulating Cellular Morphology Changes via Flow Matching
Building a virtual cell capable of accurately simulating cellular behaviors in silico has long been a dream in computational biology. We introduce CellFlux, an image-generative model that simulates cellular morphology changes induced by chemical and genetic…
🔥4
Google DeepMind introduced Concordia 2.0, an update to Google’s library for building multi-actor LLM simulations
At the core:
- Entity-Component Architecture — where even the “Game Master” (GM) is just another configurable entity
- Engineers build components → Designers compose & configure
- Enables modularity, rapid iteration & scalable world-building
Demoed in the evolving Concordia library — where AI worlds are built like RPG campaigns.
GitHub.
At the core:
- Entity-Component Architecture — where even the “Game Master” (GM) is just another configurable entity
- Engineers build components → Designers compose & configure
- Enables modularity, rapid iteration & scalable world-building
Demoed in the evolving Concordia library — where AI worlds are built like RPG campaigns.
GitHub.
🆒4🔥3
Distributed computing agents in AgentsNet
AgentsNet transforms classical distributed computing problems into a benchmark for evaluating how LLM agents can coordinate when organized in a network.
In AgentsNet, each node is an LLM and an independent agent. In synchronous rounds, agents send and receive natural language messages to and from their neighbors, with no global view and no central controller.
Agents must collaborate to solve tasks of different theoretical complexity such as:
- Graph Coloring
- Leader Election
- Matching
- Consensus
- Vertex Cover
AgentsNet is the largest agentic benchmark in the literature - when most existing approaches deal with 2-5 agents, we evaluated setups of up to 100 agents, and the benchmark itself is infinitely scalable in size to catch up with new generations of LLMs.
Communication costs are important in large agentic networks - there is a price / performance Pareto frontier which we’d expect to be moving to the top-left corner pretty quickly as more capable and cheaper models become available.
Researchers also presented a collection of traces obtained from different problem configurations and LLMs so you can actually look into the message passing and how our agents communicate with each other to solve the problem.
Paper.
Code & data
AgentsNet transforms classical distributed computing problems into a benchmark for evaluating how LLM agents can coordinate when organized in a network.
In AgentsNet, each node is an LLM and an independent agent. In synchronous rounds, agents send and receive natural language messages to and from their neighbors, with no global view and no central controller.
Agents must collaborate to solve tasks of different theoretical complexity such as:
- Graph Coloring
- Leader Election
- Matching
- Consensus
- Vertex Cover
AgentsNet is the largest agentic benchmark in the literature - when most existing approaches deal with 2-5 agents, we evaluated setups of up to 100 agents, and the benchmark itself is infinitely scalable in size to catch up with new generations of LLMs.
Communication costs are important in large agentic networks - there is a price / performance Pareto frontier which we’d expect to be moving to the top-left corner pretty quickly as more capable and cheaper models become available.
Researchers also presented a collection of traces obtained from different problem configurations and LLMs so you can actually look into the message passing and how our agents communicate with each other to solve the problem.
Paper.
Code & data
arXiv.org
AgentsNet: Coordination and Collaborative Reasoning in Multi-Agent LLMs
Large-language models (LLMs) have demonstrated powerful problem-solving capabilities, in particular when organized in multi-agent systems. However, the advent of such systems also raises several...
🔥6
At the 2025 ACM CHI Workshop on Tools for Thought, MIT introduced a semi-formal programming paradigm where AI helps gradually enrich informal content into structured code.
MIT Visualization Group
Something In Between Formal Spec and Informal Representation | MIT Visualization Group
Programming is beautiful in its formalisms. Typed variables, function signatures, and compile-time checks create a world of certainty, where logic flows predictably, and errors are caught before they ever touch runtime.
🆒4🔥2
Meet Max from MiniMax Agent.
The world’s first full-stack AI agent, built for complex, multi-step, long-context tasks
1. Build and launch a full e-shop
2. Deliver flawless, all-in-one travel plans
3. Track & analyze your stock portfolio
Bug-free. Full-stack.
The world’s first full-stack AI agent, built for complex, multi-step, long-context tasks
1. Build and launch a full e-shop
2. Deliver flawless, all-in-one travel plans
3. Track & analyze your stock portfolio
Bug-free. Full-stack.
agent.minimax.io
MiniMax Agent: Minimize Effort, Maximize Intelligence
Discover MiniMax Agent, your AI supercompanion, enhancing creativity and productivity with tools for meditation, podcast, coding, analysis, and more!
🆒6🔥3
Google DeepMind just dropped this new LLM model architecture called Mixture-of-Recursions.
It gets 2x inference speed, reduced training FLOPs and ~50% reduced KV cache memory. Really interesting read.
Has potential to be a Transformers killer.
It gets 2x inference speed, reduced training FLOPs and ~50% reduced KV cache memory. Really interesting read.
Has potential to be a Transformers killer.
alphaXiv
Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation
View recent discussion. Abstract: Scaling language models unlocks impressive capabilities, but the accompanying computational and memory demands make both training and deployment expensive. Existing efficiency efforts typically target either parameter sharing…
🆒4🔥3
Langchain introduced Open Deep Research. Built on LangGraph, Open Deep Research:
• Uses a supervisor architecture to coordinate research sub-agents
• Supports your own LLMs, tools, and MCP servers
• Produces high-quality reports with scoped, iterative deep research.
Try it out on Open Agent Platform.
Code.
• Uses a supervisor architecture to coordinate research sub-agents
• Supports your own LLMs, tools, and MCP servers
• Produces high-quality reports with scoped, iterative deep research.
Try it out on Open Agent Platform.
Code.
LangChain Blog
Open Deep Research
TL;DR
Deep research has broken out as one of the most popular agent applications. OpenAI, Anthropic, Perplexity, and Google all have deep research products that produce comprehensive reports using various sources of context. There are also many open source…
Deep research has broken out as one of the most popular agent applications. OpenAI, Anthropic, Perplexity, and Google all have deep research products that produce comprehensive reports using various sources of context. There are also many open source…
👍3🆒3
OpenAI introduced ChatGPT agent—a unified agentic system combining Operator’s action-taking remote browser, deep research’s web synthesis, and ChatGPT’s conversational strengths.
Agent starts rolling out today to Pro, Plus, and Team users.
Pro users will get access by the end of day, while Plus and Team users will get access over the next few days.Enterprise and Edu users will get access in the coming weeks.
ChatGPT agent uses a full suite of tools, including a visual browser, text browser, a terminal, and direct APIs. ChatGPT agent dynamically chooses the best path: filtering results, running code, even generating slides and spreadsheets, while keeping full task context across steps.
ChatGPT agent has new capabilities that introduce new risks.
Agent starts rolling out today to Pro, Plus, and Team users.
Pro users will get access by the end of day, while Plus and Team users will get access over the next few days.Enterprise and Edu users will get access in the coming weeks.
ChatGPT agent uses a full suite of tools, including a visual browser, text browser, a terminal, and direct APIs. ChatGPT agent dynamically chooses the best path: filtering results, running code, even generating slides and spreadsheets, while keeping full task context across steps.
ChatGPT agent has new capabilities that introduce new risks.
Openai
Introducing ChatGPT agent: bridging research and action
ChatGPT now thinks and acts, proactively choosing from a toolbox of agentic skills to complete tasks for you using its own computer.
❤9
Decart introduced MirageLSD: The First Live-Stream Diffusion (LSD) AI Model
Input any video stream, from a camera or video chat to a computer screen or game, and transform it into any world you desire, in real-time (<40ms latency).
Try it here.
Input any video stream, from a camera or video chat to a computer screen or game, and transform it into any world you desire, in real-time (<40ms latency).
Try it here.
mirage.decart.ai
Decart API Platform
Creativity without the wait
❤4
Google will release its first fully self-developed smartphone chip, Tensor G5, in the upcoming Pixel 10 smartphone to be unveiled 8/20, media report, a break from the past when Google worked with Samsung on the chip.
Google also switched manufacturers, tapping TSMC’s 3nm process for the Tensor G5.
Google also switched manufacturers, tapping TSMC’s 3nm process for the Tensor G5.
Cnyes
Google宣布8月推Pixel新機 AI功能與台積電晶片成亮點 | 鉅亨網 - 美股雷達
鉅亨網編譯段智恆 2025-07-17 02:00
🔥3
Apple introduced Foundation Models and a new Foundation Models framework, which gives app developers direct access to the on-device AFM model.
Apple Machine Learning Research
Apple Intelligence Foundation Language Models Tech Report 2025
We introduce two multilingual, multimodal foundation language models that power Apple Intelligence features across Apple devices and…
AWS launched Kiro, a new agentic IDE to take on Cursor
It combines agentic coding with spec-driven development to bridge the gap between AI prototypes and complex production-ready apps.
Free preview now available.
It combines agentic coding with spec-driven development to bridge the gap between AI prototypes and complex production-ready apps.
Free preview now available.
kiro.dev
Introducing Kiro
A new agentic IDE that works alongside you from prototype to production
🔥3
Ai2 introduced AutoDS—an AI that doesn’t just hunt for answers, it decides which questions are worth asking.
AutoDS spins up its own hypotheses, runs the stats, learns from the outcomes, and then repeats. The system can use the results of statistical experiments it generates and conducts to propose new hypotheses.
Paper.
AutoDS spins up its own hypotheses, runs the stats, learns from the outcomes, and then repeats. The system can use the results of statistical experiments it generates and conducts to propose new hypotheses.
Paper.
allenai.org
AutoDS: A prototype engine for autonomous, open-ended scientific discovery | Ai2
AutoDS goes beyond standard data crunching by building upon its own findings and uncovering insights that may not be immediately apparent even to experienced researchers.
🆒4🔥3
Olympiad math + AI: Google ran Gemini 2.5 Pro on the fresh IMO 2025 problems.
With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity.
The model could win gold.
With careful prompting and pipeline design, it solved 5 out of 6 — remarkable for tasks demanding deep insight and creativity.
The model could win gold.
GitHub
IMO25/IMO25.pdf at main · lyang36/IMO25
An AI agent system for solving International Mathematical Olympiad (IMO) problems using Google's Gemini, OpenAI, and XAI APIs. - lyang36/IMO25
🥰3
Kimi K2 paper dropped. Some notes:
1. MuonClip optimizer.
2. large-scale agentic data synthesis pipeline that systematically generates tool-use demonstrations via simulated and real-world environments.
3. an RL framework that combines RLVR with a self-
critique rubric reward mechanism that allows model to evaluate its own outputs.
1. MuonClip optimizer.
2. large-scale agentic data synthesis pipeline that systematically generates tool-use demonstrations via simulated and real-world environments.
3. an RL framework that combines RLVR with a self-
critique rubric reward mechanism that allows model to evaluate its own outputs.
GitHub
Kimi-K2/tech_report.pdf at main · MoonshotAI/Kimi-K2
Kimi K2 is the large language model series developed by Moonshot AI team - MoonshotAI/Kimi-K2
🔥3👍2
IBM introduced a Framework for Quantum Advantage
Researchers defined quantum advantage as a task where results are verifiable and quantum outperforms classical in cost, efficiency, or accuracy.
The pathways forward are methods that are either:
- Provably bounded
- Variational
- Classically verifiable
Paper.
Researchers defined quantum advantage as a task where results are verifiable and quantum outperforms classical in cost, efficiency, or accuracy.
The pathways forward are methods that are either:
- Provably bounded
- Variational
- Classically verifiable
Paper.
arXiv.org
A Framework for Quantum Advantage
As quantum computing approaches the threshold where certain tasks demonstrably outpace their classical machines, the need for a precise, clear, consensus-driven definition of quantum advantage...
🆒5🔥2
Google DeepMind shared pre-print AMIE research diagnostic dialogue AI.
Researchers introduced a new asynchronous oversight paradigm, decoupling history-taking by AMIE from sharing a human-approved diagnosis.
AMIE can perform consultations with patients to gather information within guardrails (g-AMIE), abstaining from individualized medical advice. A diagnosis and treatment plan is proposed, which licensed physicians authorize through our interface, the clinician cockpit.
Guardrailed-AMIE multi-agent system consists of a multi-phase dialogue agent, a guardrail agent and a SOAP note generation agent based on Gemini 2.0 Flash.
Researchers evaluate workflow in a virtual Objective Structured Clinical Examination (OSCE) study with oversight, contextualizing g-AMIE’s performance with control groups consisting of primary care physicians (PCPs) and nurse practitioners (NPs)/physician assistants/associates (PAs).
g-AMIE and the control groups (g-PCP and g-NP/PA) all operate under the same guardrails of not providing individualized medical advice during consultations and draft SOAP notes for handoff.
This work has various limitations and nuances, including the difficulty of classifying individualized medical advice, the AI-focused workflow which was unfamiliar to both control groups, high mental load required for oversight and the simulated nature of our OSCE study.
Because of this, results need to be interpreted with care and cannot be used to draw conclusions about the relative performance of our PCP, NP and PA control groups.
Researchers introduced a new asynchronous oversight paradigm, decoupling history-taking by AMIE from sharing a human-approved diagnosis.
AMIE can perform consultations with patients to gather information within guardrails (g-AMIE), abstaining from individualized medical advice. A diagnosis and treatment plan is proposed, which licensed physicians authorize through our interface, the clinician cockpit.
Guardrailed-AMIE multi-agent system consists of a multi-phase dialogue agent, a guardrail agent and a SOAP note generation agent based on Gemini 2.0 Flash.
Researchers evaluate workflow in a virtual Objective Structured Clinical Examination (OSCE) study with oversight, contextualizing g-AMIE’s performance with control groups consisting of primary care physicians (PCPs) and nurse practitioners (NPs)/physician assistants/associates (PAs).
g-AMIE and the control groups (g-PCP and g-NP/PA) all operate under the same guardrails of not providing individualized medical advice during consultations and draft SOAP notes for handoff.
This work has various limitations and nuances, including the difficulty of classifying individualized medical advice, the AI-focused workflow which was unfamiliar to both control groups, high mental load required for oversight and the simulated nature of our OSCE study.
Because of this, results need to be interpreted with care and cannot be used to draw conclusions about the relative performance of our PCP, NP and PA control groups.
🔥3
Alibaba released Qwen3-Coder
This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation.
It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified.
Alongside the model, also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities.
This 480B-parameter Mixture-of-Experts model (35B active) natively supports 256K context and scales to 1M context with extrapolation.
It achieves top-tier performance across multiple agentic coding benchmarks among open models, including SWE-bench-Verified.
Alongside the model, also open-sourcing a command-line tool for agentic coding: Qwen Code. Forked from Gemini Code, it includes custom prompts and function call protocols to fully unlock Qwen3-Coder’s capabilities.
chat.qwen.ai
Qwen Chat
Qwen Chat offers comprehensive functionality spanning chatbot, image and video understanding, image generation, document processing, web search integration, tool utilization, and artifacts.
🔥5
Kaggle launched an LLM eval product
Kaggle
Find Benchmarks | Kaggle
Use and download benchmarks for your machine learning projects.