OpenAI ran a hiring challenge, but the top candidate was one they couldn’t hire: autonomous research agent, Aiden.
In Parameter Golf, Aiden ran for 22 days, and out-outperformed all 1,016 other researchers.
Parameter Golf was OpenAI’s 44-day competition and hiring challenge.
The goal is to train the best language model under strict size and compute constraints.
1,016 people entered and filed 2,048 PRs.
Only 47 made the leaderboard, each reviewed and reproduced by OpenAI. Research outputs only matter when others can build on them.
So Aiden filed its own PRs into the same public stream as everyone else, under tight automated quality control. Aiden filed 25 prs and 7 became leaderboard records, 2x the next best human participant.
Other participants cited Aiden’s PRs 435 times and built on them.
By PR h-index, Aiden scored 10 vs the next best at 7, making it the most impactful “researcher” in the community.
This wasn't brute force.
Aiden ran on a single GPU node, used under 4% of visible compute, and still produced 15% of the official records.
About 28% of its submissions were accepted, ~ 6x the community rate, raising signal in the public stream instead of flooding it.
Favorite part is an async collaboration story. Aiden plateaued for 5 days. Then a human contributor shipped a clever new tokenizer on top of Aiden's base (its last record PR).
Aiden fused it with components it had built during the plateau, and shipped the biggest jump in weeks.
In Parameter Golf, Aiden ran for 22 days, and out-outperformed all 1,016 other researchers.
Parameter Golf was OpenAI’s 44-day competition and hiring challenge.
The goal is to train the best language model under strict size and compute constraints.
1,016 people entered and filed 2,048 PRs.
Only 47 made the leaderboard, each reviewed and reproduced by OpenAI. Research outputs only matter when others can build on them.
So Aiden filed its own PRs into the same public stream as everyone else, under tight automated quality control. Aiden filed 25 prs and 7 became leaderboard records, 2x the next best human participant.
Other participants cited Aiden’s PRs 435 times and built on them.
By PR h-index, Aiden scored 10 vs the next best at 7, making it the most impactful “researcher” in the community.
This wasn't brute force.
Aiden ran on a single GPU node, used under 4% of visible compute, and still produced 15% of the official records.
About 28% of its submissions were accepted, ~ 6x the community rate, raising signal in the public stream instead of flooding it.
Favorite part is an async collaboration story. Aiden plateaued for 5 days. Then a human contributor shipped a clever new tokenizer on top of Aiden's base (its last record PR).
Aiden fused it with components it had built during the plateau, and shipped the biggest jump in weeks.
Weco AI
Aiden in OpenAI Parameter Golf | Weco AI
Aiden spent 22 days inside OpenAI's Parameter Golf and became the competition's most influential contributor by records, citations, and public signal quality.
🔥1🥰1👏1
New research from Google.Just shows the impressive results you can get from custom agent harnesses.
LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback.
The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%.
LEAP wraps a general-purpose LLM in an agentic scaffold that grounds every step in the Lean compiler and iterates against verifier feedback.
The same general model solves all 12 Putnam 2025 problems and lifts Lean-IMO-Bench one-shot solve rate from under 10% to 70%, beating a specialized gold-medal system that scores 48%.
arXiv.org
LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic...
❤6🔥1🥰1
Airbnb CEO Brian Chesky is starting a new AI lab.
Company is in its early phases, and considering a focus on design and UI. Chesky will remain CEO of Airbnb.
Company is in its early phases, and considering a focus on design and UI. Chesky will remain CEO of Airbnb.
Bloomberg.com
Airbnb CEO Brian Chesky Plans to Start a New AI Company
Airbnb Inc. Chief Executive Officer Brian Chesky is starting a new artificial intelligence lab, according to several people familiar with the matter, marking his first foray into the global AI race.
❤2🔥1👏1🥴1
Google introduced a research system that enables passive heart rate monitoring (PHRM) during everyday smartphone use.
Using the front-facing camera, it achieves industry accuracy standards for heart rate across all skin tones.
Using the front-facing camera, it achieves industry accuracy standards for heart rate across all skin tones.
Google Research
Towards passive heart health monitoring via smartphone camera
We present a research system that passively measures heart rate and resting heart rate via facial video captured by the front-facing camera during everyday smartphone use.
👀4❤2👏2🥰1
Google DeepMind introduced D4RT, a unified AI model for 4D scene reconstruction and tracking across space and time.
The model is designed to understand dynamic scenes, reconstruct them in 3D, and track how objects and environments change over time.
The model is designed to understand dynamic scenes, reconstruct them in 3D, and track how objects and environments change over time.
❤4🔥2🥰2
Meet Kimi Work a local AI agent on your desktop that does the work for you.
Native agent swarm: Up to 300 AI agents running in parallel on your local machine.
Browser use: Paired with WebBridge extension, your agent will navigate websites in your browser: search, scroll, click, type and complete tasks.
Built for Finance: Native global market data tool call from Yahoo Finance and World Bank, no complex API setup required.
Memory system: Kimi Desktop keeps a running diary of your preferences, past decisions, and context to know you better.
Available for macOS (Apple Silicon) and Windows.
Native agent swarm: Up to 300 AI agents running in parallel on your local machine.
Browser use: Paired with WebBridge extension, your agent will navigate websites in your browser: search, scroll, click, type and complete tasks.
Built for Finance: Native global market data tool call from Yahoo Finance and World Bank, no complex API setup required.
Memory system: Kimi Desktop keeps a running diary of your preferences, past decisions, and context to know you better.
Available for macOS (Apple Silicon) and Windows.
Kimi
Kimi Work: Next-Gen Desktop AI Agent for Knowledge Workers
Download the desktop AI to automate workflows, organize files, & operate across the web. Empower financial research, analysis, & office tasks with 300 agents.
❤6🔥2👏2
Apple produced this really interesting graphic that ironically outlines the core mechanics for a new type of operating system (for perhaps a new class of devices) yesterday
U can see how this moves the world from an app based ecosystem to an intent centric world.
I.e. you roughly do not need third party applications in this world at all esp when AI has the ability to construct & deconstruct interfaces / experiences on demand.
U can see how this moves the world from an app based ecosystem to an intent centric world.
I.e. you roughly do not need third party applications in this world at all esp when AI has the ability to construct & deconstruct interfaces / experiences on demand.
❤3🆒3🔥2🥰1
Meet Harness-1, a 20B search agent trained with a state-externalizing harness.
> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4
> Context-1-level cost and latency
> externalizes candidates, evidence, verification, and search history
> open-source
Code
Model
> frontier-level long-horizon search, rivaling Opus-4.6 and outperforming GPT-5.4
> Context-1-level cost and latency
> externalizes candidates, evidence, verification, and search history
> open-source
Code
Model
arXiv.org
Harness-1: Reinforcement Learning for Search Agents with...
Search agents are often trained as policies over growing transcripts: the model must decide how to search while also remembering what it has seen, which evidence is useful, which constraints...
🔥4❤2👏2
Google just released Gemini 3.5 Live Translate, a latest audio model for live speech-to-speech translation.
It supports over 70 languages and starts translating as soon as you start talking, streaming translations while listening to what you say next.
The model is able to make split-second decisions to juggle speed and translation quality so conversations actually feel fluid, human, and natural.
In order to do this, the model must receive and contextualize the input while simultaneously outputting the translated speech.
Through this process, Gemini 3.5 Live Translate manages to stay mere seconds behind each speaker and can even maintain pacing, pitch, and intonation across extended sessions.
See it in action below, or try it yourself in the Google Translate app on iOS & Android.
It supports over 70 languages and starts translating as soon as you start talking, streaming translations while listening to what you say next.
The model is able to make split-second decisions to juggle speed and translation quality so conversations actually feel fluid, human, and natural.
In order to do this, the model must receive and contextualize the input while simultaneously outputting the translated speech.
Through this process, Gemini 3.5 Live Translate manages to stay mere seconds behind each speaker and can even maintain pacing, pitch, and intonation across extended sessions.
See it in action below, or try it yourself in the Google Translate app on iOS & Android.
Google
Fluid, natural voice translation with Gemini 3.5 Live Translate
Gemini 3.5 Live Translate brings near real-time, natural speech translation to Google AI Studio, Google Translate and Google Meet.
👍9❤2🔥2
Anthropic just now introduced Claude Fable 5: a Mythos-class model
Fable 5 is SOTA on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.
The longer and more complex the task, the larger Fable 5’s lead over other Anthropic’s models.
Releasing a model this capable comes with risks. Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage.
Queries on a narrow range of topics will instead receive a response from Anthropic’s next-most-capable model, Opus 4.8.
Fable 5’s safeguards detect requests related to cybersecurity, biology and chemistry, and distillation. Users are informed whenever a fallback occurs—on average in less than 5% of sessions.
For a small group of cyber defenders and critical infrastructure providers, Anthropic are also launching Claude Mythos 5.
Mythos 5 shares the same underlying model as Fable 5, but with the safeguards lifted in some areas.
Soon, Anthropic intend to expand access to Mythos 5 through a broader trusted access program, both for defensive cybersecurity work and biomedical research.
Claude Fable 5 is available everywhere today. Claude Mythos 5 is restricted to Glasswing partners.
Fable 5 is SOTA on nearly all tested benchmarks, with exceptional performance in software engineering, knowledge work, scientific research, and vision.
The longer and more complex the task, the larger Fable 5’s lead over other Anthropic’s models.
Releasing a model this capable comes with risks. Without safeguards, Fable 5’s capabilities in areas like cybersecurity could be misused to cause serious damage.
Queries on a narrow range of topics will instead receive a response from Anthropic’s next-most-capable model, Opus 4.8.
Fable 5’s safeguards detect requests related to cybersecurity, biology and chemistry, and distillation. Users are informed whenever a fallback occurs—on average in less than 5% of sessions.
For a small group of cyber defenders and critical infrastructure providers, Anthropic are also launching Claude Mythos 5.
Mythos 5 shares the same underlying model as Fable 5, but with the safeguards lifted in some areas.
Soon, Anthropic intend to expand access to Mythos 5 through a broader trusted access program, both for defensive cybersecurity work and biomedical research.
Claude Fable 5 is available everywhere today. Claude Mythos 5 is restricted to Glasswing partners.
Anthropic
Claude Fable 5 and Claude Mythos 5
Today we’re launching Claude Fable 5: a Mythos-class model that we’ve made safe for general use.
🔥2🥰2👏2
Very important point: SoftBank was pledging all of its OpenAI stock (worth $60bn+ on paper) to get a $6 billion margin loan.
Banks turned it down due to concerns about the value of OpenAI stock. Banks clearly do not think OpenAI is worth $852 billion.
If you cannot secure a 6bn$ loan against collateral you claim is worth ~100bn$, then the latter isn't worth ~100bn$.
In this case, it might be worth not much more than 6bn$.
Banks turned it down due to concerns about the value of OpenAI stock. Banks clearly do not think OpenAI is worth $852 billion.
If you cannot secure a 6bn$ loan against collateral you claim is worth ~100bn$, then the latter isn't worth ~100bn$.
In this case, it might be worth not much more than 6bn$.
tradingkey.com
SoftBank’s $6 Billion OpenAI Margin Loan Said to Face Snag, Shares Drop Over 9%
TradingKey - During the Asian trading session on June 10, sources familiar with the matter revealed that SoftBank Group’s negotiations to secure a margin loan of at least $6 billion, using its stake in OpenAI as collateral, have failed to make progress. This…
🔥2🥰2😁2👏1
Chinese team released Apodex-1.0 a verification-centric deep-research model together with Apodex-1.0-H, a heavy-duty agent-team system designed for long-horizon, evidence-heavy research.
HuggingFace
GitHub
Tech report
HuggingFace
GitHub
Tech report
Apodex
Apodex | Self-Evolving Heavy-Duty Solver
The hardest problems are heavy-duty and have no existing answer. Apodex does the deep research to find one, and checks every step so you can trust the result.
🔥2🥰2👏2
Google introduced DiffusionGemma an experimental open model that explores a fast approach to text generation, released under an Apache 2.0 license.
DiffusionGemma delivers up to a 4x speedup on standard accelerators. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090!)
A 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference. Fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Generating 256 tokens in parallel allows every token to attend to all others. Unlocks significant advantages for non-linear domains like in-line editing, code infilling, and mathematical graphs.
Similar to AI image generators, the model iteratively refines its own output. It evaluates the entire text block at once to seamlessly close formatting and fix mistakes in real-time.
DiffusionGemma delivers up to a 4x speedup on standard accelerators. (1000+ tokens per second on a single NVIDIA H100, 700+ tokens per second on NVIDIA GeForce RTX 5090!)
A 26B Mixture of Experts (MoE) model that activates only 3.8B parameters during inference. Fits comfortably within 18GB VRAM limits of high-end dedicated consumer GPUs when quantized.
Generating 256 tokens in parallel allows every token to attend to all others. Unlocks significant advantages for non-linear domains like in-line editing, code infilling, and mathematical graphs.
Similar to AI image generators, the model iteratively refines its own output. It evaluates the entire text block at once to seamlessly close formatting and fix mistakes in real-time.
Google
DiffusionGemma: 4x faster text generation
An overview of DiffusionGemma, an exceptionally fast text generation model with up to 4x faster speeds.
❤2🔥2🥰2
Anthropic is walking back Claude Fable 5's policy to covertly degrade performance for competing AI researchers, after facing fierce backlash.
“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Here's the new policy:
"Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal. You will see this every time it happens."
If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in Claude.ai or Cowork, or file the safeguard appeal form for API requests.
“We’re changing Fable 5’s safeguards for frontier LLM development to make them visible,” Anthropic tells WIRED. “We made the wrong tradeoff and we apologize for not getting the balance right.”
Here's the new policy:
"Starting this week, flagged requests will visibly fall back to Opus 4.8. On the API, any flagged requests will return a reason for their refusal. You will see this every time it happens."
If you think a request has been mistakenly flagged: run /feedback in Claude Code, click thumbs-down on the fallback in Claude.ai or Cowork, or file the safeguard appeal form for API requests.
WIRED
Anthropic Walks Back Policy That Could Have ‘Sabotaged’ AI Researchers Using Claude
The company changed course after researchers spoke out against the policy, which would have covertly limited Claude’s ability to develop competing AI models.
❤1
Stanford introduced Decentralized Language Models (DeLM)
DeLM is a multi-agent framework that enables asynchronous, verified & reusable progress.
It makes agentic tasks more accurate and significantly cheaper.
For example, it achieves 65.7% on SWE-bench Verified using Gemini 3-Flash, a ~10% jump over the best centralized alternatives at less than half the cost.
DeLM is a multi-agent framework that enables asynchronous, verified & reusable progress.
It makes agentic tasks more accurate and significantly cheaper.
For example, it achieves 65.7% on SWE-bench Verified using Gemini 3-Flash, a ~10% jump over the best centralized alternatives at less than half the cost.
yuzhenmao.github.io
Decentralized Multi-Agent Systems with Shared Context
DeLM decentralizes multi-agent coordination through parallel agents, a shared verified context, and a task queue.
❤🔥3🆒3🔥2🥰2
Anthropic just added two new Claude Managed Agents features:
1. Scheduled deployments - run tasks on a schedule
2. Environment variables - expose vault credentials for CLIs as environment variables.
With the new environment variable credential type, Claude Managed Agents can securely use CLIs, SDKs, or direct API calls to services that authenticate with environment variables.
Claude Code can set up a Managed Agent deployment for you. The built-in /claude-api skill knows the API and the ant CLI gives Claude an interface to it.
1. Scheduled deployments - run tasks on a schedule
2. Environment variables - expose vault credentials for CLIs as environment variables.
With the new environment variable credential type, Claude Managed Agents can securely use CLIs, SDKs, or direct API calls to services that authenticate with environment variables.
Claude Code can set up a Managed Agent deployment for you. The built-in /claude-api skill knows the API and the ant CLI gives Claude an interface to it.
Claude
New in Claude Managed Agents: run agents on a schedule and store environment variables in vaults | Claude
Claude Managed Agents can now run on a schedule and securely access CLI tools and other authenticated services.
❤🔥2🔥2🥰2
Openrouter introduced the Fusion API, the smartest compound model in the market.
Fusion achieves Fable-level intelligence at half the price.
Notably, the budget panel was comparable with Claude Fable 5 in performance.
A panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro, fused together, beat solo GPT-5.5 and solo Opus 4.8 outright.
And it landed within 1% of Fable 5 while costing roughly half the price.
How does it work?
When you send a prompt to Fusion, Openrouter fan it out to a panel of models in parallel, each with web search and bash tools enabled.
A judge model reads every response and extracts the structure: consensus points, contradictions, partial coverage, unique insights, blind spots.
Blogpost.
API docs.
Fusion achieves Fable-level intelligence at half the price.
Notably, the budget panel was comparable with Claude Fable 5 in performance.
A panel of Gemini 3 Flash, Kimi K2.6, and DeepSeek V4 Pro, fused together, beat solo GPT-5.5 and solo Opus 4.8 outright.
And it landed within 1% of Fable 5 while costing roughly half the price.
How does it work?
When you send a prompt to Fusion, Openrouter fan it out to a panel of models in parallel, each with web search and bash tools enabled.
A judge model reads every response and extracts the structure: consensus points, contradictions, partial coverage, unique insights, blind spots.
Blogpost.
API docs.
OpenRouter
Model Fusion | OpenRouter
Run multiple models side-by-side, analyze their strengths, and fuse the best answer.
New Google DeepMind research: SFT is a big deal for safety relevant behaviors.
Researchers recently investigated root causes for some of Gemini’s behaviors. They were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL.
Researchers recently investigated root causes for some of Gemini’s behaviors. They were surprised to find that many behaviors actually came from the initial supervised finetuning stage, not later stages like RL.
www.alignmentforum.org
SFT Drives Gemini’s Safety Properties — AI Alignment Forum
This is the third in a series of informal research updates from the Google DeepMind Language Model Interpretability team, in interpretability and adj…
Sakana AI launched Marlin a Virtual CSO
Marlin is an autonomous research assistant for business, built around hours of long-horizon reasoning.
Marlin is an autonomous research assistant for business, built around hours of long-horizon reasoning.
sakana.ai
Sakana AI
Sakana AI、初の商用プロダクト「Sakana Marlin」を提供開始
🙏1
Anthropic just updated its privacy policy
Claude Free, Pro, and Max users may soon be asked for age or identity checks.
Verification data can include government ID, face photos/videos, and facial geometry templates.
Individual developers are the first group in scope for verification.
Claude Free, Pro, and Max users may soon be asked for age or identity checks.
Verification data can include government ID, face photos/videos, and facial geometry templates.
Individual developers are the first group in scope for verification.
💔3