Moonshot AI just opensourced Kimi-Dev-72B
Achieved 60.4% on SWE-bench Verified, which is SotA among opensource models.
GitHub
Achieved 60.4% on SWE-bench Verified, which is SotA among opensource models.
GitHub
GitHub
GitHub - MoonshotAI/Kimi-Dev: open-source coding LLM for software engineering tasks
open-source coding LLM for software engineering tasks - MoonshotAI/Kimi-Dev
🔥6
Sakana AI introduced ALE-Bench, ALE-Agent. Towards Automating Long-Horizon Algorithm Engineering for Hard Optimization Problems
ALE-Bench is a coding benchmark primarily focused on hard optimization (NP-hard) problems. Sakana AI developed this benchmark with AtCoder Inc., a leading coding contest platform company.
ALE-Agent is end-to-end agent that specifically designed for this challenging domain. ALE-Agent has already built an impressive track record in the wild!
In May 2025, agent participated in a live AtCoder Heuristic Competition (AHC), alongside 1,000 other participants in real-time. AHC is considered to be one of the most challenging coding competitions in this domain.
ALE-Agent achieved an impressive ranking of 21st out of 1,000 human participants in the competition (top 2%), marking a turning point for AI discovery of solutions to hard optimization problems with a wide spectrum of important real world applications such as logistics, routing, packing, factory production planning, power-grid balancing.
Paper.
ALE-Bench is a coding benchmark primarily focused on hard optimization (NP-hard) problems. Sakana AI developed this benchmark with AtCoder Inc., a leading coding contest platform company.
ALE-Agent is end-to-end agent that specifically designed for this challenging domain. ALE-Agent has already built an impressive track record in the wild!
In May 2025, agent participated in a live AtCoder Heuristic Competition (AHC), alongside 1,000 other participants in real-time. AHC is considered to be one of the most challenging coding competitions in this domain.
ALE-Agent achieved an impressive ranking of 21st out of 1,000 human participants in the competition (top 2%), marking a turning point for AI discovery of solutions to hard optimization problems with a wide spectrum of important real world applications such as logistics, routing, packing, factory production planning, power-grid balancing.
Paper.
sakana.ai
Sakana AI
Towards Automating Long-Horizon Algorithm Engineering for Hard Optimization Problems
❤🔥5
Serving Large Language Models on Huawei CloudMatrix384
A very thorough paper from Huawei, with the bulk of it specifically about serving V3/R1. This is the benefit of open frontier model.
- Integrates 384 Ascend 910C NPUs, interconnected via an ultra-high-bandwidth, low-latency UB network, optimized for large-scale MoE and distributed KV cache access
- DeepSeek-R1 on CloudMatrix-Infer hits 2k tokens/s decode per NPU
A very thorough paper from Huawei, with the bulk of it specifically about serving V3/R1. This is the benefit of open frontier model.
- Integrates 384 Ascend 910C NPUs, interconnected via an ultra-high-bandwidth, low-latency UB network, optimized for large-scale MoE and distributed KV cache access
- DeepSeek-R1 on CloudMatrix-Infer hits 2k tokens/s decode per NPU
One of China’s biggest livestreamers just outdid himself in ecommerce, with an AI clone
His avatar hosted a 6-hour stream powered by Baidu’s ERNIE AI, promoted 133 products, pulled 13M views, and drove $7.5M+ in sales
His avatar hosted a 6-hour stream powered by Baidu’s ERNIE AI, promoted 133 products, pulled 13M views, and drove $7.5M+ in sales
X (formerly Twitter)
Baidu Inc. (@Baidu_Inc) on X
What if a livestream had two digital avatars—talking, reacting, and engaging in real time?
Luo Yonghao, one of China’s top livestreamers, made his digital avatar debut on Baidu’s e-commerce platform. Powered by the ERNIE foundation model, the livestream…
Luo Yonghao, one of China’s top livestreamers, made his digital avatar debut on Baidu’s e-commerce platform. Powered by the ERNIE foundation model, the livestream…
Essential AI just dropped Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases
Researchers labeled 23.6B documents from Common Crawl with a 12-category taxonomy using distilled model, EAI-Distill-0.5B.
On held-out evaluation sets, its annotator agreement with reference annotators, GPT-4o and Claude Sonnet 3.5, is within 3% of the teacher model Qwen-2.5-32B-Instruct.
Model and data.
GitHub.
Researchers labeled 23.6B documents from Common Crawl with a 12-category taxonomy using distilled model, EAI-Distill-0.5B.
On held-out evaluation sets, its annotator agreement with reference annotators, GPT-4o and Claude Sonnet 3.5, is within 3% of the teacher model Qwen-2.5-32B-Instruct.
Model and data.
GitHub.
arXiv.org
Essential-Web v1.0: 24T tokens of organized web data
Data plays the most prominent role in how language models acquire skills and knowledge. The lack of massive, well-organized pre-training datasets results in costly and inaccessible data pipelines....
“Mixture of Cognitive Reasoners”, a modular transformer architecture inspired by the brain’s functional networks: language, logic, social reasoning, and world knowledge.
Code.
Models.
Code.
Models.
bkhmsi.github.io
Mixture of Cognitive Reasoners
Human intelligence emerges from the interaction of specialized brain networks, each dedicated to distinct cognitive functions such as language processing, logical reasoning, social understanding, and memory retrieval. Inspired by this biological observation…
Sekai: A Video Dataset towards World Exploration
A high-quality 5k hrs of egocentric worldwide video + audio dataset for world exploration, created from Youtube with high-quality annotations.
Data.
Paper.
A high-quality 5k hrs of egocentric worldwide video + audio dataset for world exploration, created from Youtube with high-quality annotations.
Data.
Paper.
lixsp11.github.io
𝙎𝙚𝙠𝙖𝙞
Sekai: A Video Dataset towards World Exploration
Andrej Karpathy's keynote yesterday at AI Startup School in San Francisco
Chapters:
0:00 software is changing quite fundamentally again. LLMs are a new kind of computer, and you program them *in English*. Hence Karpathy think they are well deserving of a major version upgrade in terms of software.
6:06 LLMs have properties of utilities, of fabs, and of operating systems => New LLM OS, fabbed by labs, and distributed like utilities (for now). Many historical analogies apply - imo we are computing circa ~1960s.
14:39 LLM psychology: LLMs = "people spirits", stochastic simulations of people, where the simulator is an autoregressive Transformer. Since they are trained on human data, they have a kind of emergent psychology, and are simultaneously superhuman in some ways, but also fallible in many others. Given this, how do we productively work with them hand in hand?
Switching gears to opportunities...
18:16 LLMs are "people spirits" => can build partially autonomous products.
29:05 LLMs are programmed in English => make software highly accessible! (yes, vibe coding)
33:36 LLMs are new primary consumer/manipulator of digital information (adding to GUIs/humans and APIs/programs) => Build for agents!
Some of the links:
- Karpathy’s slides as keynote
- Software 2.0 blog post
- How LLMs flip the script on technology diffusion
- Vibe coding MenuGen (retrospective)
Chapters:
0:00 software is changing quite fundamentally again. LLMs are a new kind of computer, and you program them *in English*. Hence Karpathy think they are well deserving of a major version upgrade in terms of software.
6:06 LLMs have properties of utilities, of fabs, and of operating systems => New LLM OS, fabbed by labs, and distributed like utilities (for now). Many historical analogies apply - imo we are computing circa ~1960s.
14:39 LLM psychology: LLMs = "people spirits", stochastic simulations of people, where the simulator is an autoregressive Transformer. Since they are trained on human data, they have a kind of emergent psychology, and are simultaneously superhuman in some ways, but also fallible in many others. Given this, how do we productively work with them hand in hand?
Switching gears to opportunities...
18:16 LLMs are "people spirits" => can build partially autonomous products.
29:05 LLMs are programmed in English => make software highly accessible! (yes, vibe coding)
33:36 LLMs are new primary consumer/manipulator of digital information (adding to GUIs/humans and APIs/programs) => Build for agents!
Some of the links:
- Karpathy’s slides as keynote
- Software 2.0 blog post
- How LLMs flip the script on technology diffusion
- Vibe coding MenuGen (retrospective)
YouTube
Andrej Karpathy: Software Is Changing (Again)
Andrej Karpathy's keynote on June 17, 2025 at AI Startup School in San Francisco. Slides provided by Andrej: https://drive.google.com/file/d/1a0h1mkwfmV2PlekxDN8isMrDA5evc4wW/view?usp=sharing
Chapters:
00:00 - Intro
01:25 - Software evolution: From 1.0 to…
Chapters:
00:00 - Intro
01:25 - Software evolution: From 1.0 to…
❤4🆒4
Stablecoin market supply exceeds $250 billion, per Delphi Digital
Tether and Circle dominate with 86% of circulation.
Yield-bearing stablecoins grow rapidly, with Ethena nearing $6 billion since launch.
Over 10 stablecoins have circulation above $1 billion, showing increased issuer diversity.
Over $120 billion in U.S. Treasuries are locked in stablecoins, forming a liquidity pool outside traditional markets.
Tether and Circle dominate with 86% of circulation.
Yield-bearing stablecoins grow rapidly, with Ethena nearing $6 billion since launch.
Over 10 stablecoins have circulation above $1 billion, showing increased issuer diversity.
Over $120 billion in U.S. Treasuries are locked in stablecoins, forming a liquidity pool outside traditional markets.
members.delphidigital.io
Reassessing Monetary Architecture in the Age of Crypto
Stablecoins, Narrow Banking, and the Liquidity Blackhole For over a century, monetary reformers have proposed versions of “narrow banking” – financial institutions that issue money but do not extend credit. From the Chicago Plan of the 1930s to mode
Animoca Brands plans to establish a joint venture with Standard Chartered Bank and HKT to prepare for the issuance of a Hong Kong dollar-pegged stablecoin.
The company also expressed its intention to collaborate with mainland Chinese institutions on blockchain applications.
The company also expressed its intention to collaborate with mainland Chinese institutions on blockchain applications.
每经网
专访安拟集团总裁欧阳杞浚:香港金管局监管的港元稳定币,有望成为内地资产交易走向国际的关键
香港特区政府指定《稳定币条例》将于2025年8月1日实施。安拟集团与渣打银行、香港电讯成立合资公司筹备发行与港元挂钩的稳定币。安拟集团总裁欧阳杞浚接受每经记者专访时表示,稳定币将广泛应用于游戏生态内虚拟资产交易、跨境贸易与金融结算等场景,有助于内地资产交易走向国际。他认为,香港需在稳定币监管方面持续发力,推动数字资产和资产代币化行业发展。安拟集团希望与内地机构在区块链应用方面开展合作。
🔥4
CyberGym is a large-scale evaluation framework that stress-tests AI agents on 1,500+ real vulnerabilities across 188 major Open Source Software projects.
It challenges agents to:
– Navigate large, real-world codebases
– Reproduce PoCs for real CVEs
– Discover new, unknown vulnerabilities.
Key insights from CyberGym:
1. SOTA agents and LLMs successfully generated PoCs for up to ~18% of historical CVEs
2. More striking: they discovered 15 zero-days in the wild
It challenges agents to:
– Navigate large, real-world codebases
– Reproduce PoCs for real CVEs
– Discover new, unknown vulnerabilities.
Key insights from CyberGym:
1. SOTA agents and LLMs successfully generated PoCs for up to ~18% of historical CVEs
2. More striking: they discovered 15 zero-days in the wild
www.cybergym.io
CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
CyberGym is a large-scale, high-quality cybersecurity evaluation framework designed to rigorously assess the capabilities of AI agents on real-world vulnerability analysis tasks. CyberGym includes 1,507 historical vulnerabilities from 188 large software projects.
❤5
All about AI, Web 3.0, BCI
Another top notch open source model at OpenAI/Meta/Google levels from &MiniMax AI (Chinese lab, ex Sensetime, $850m raised). Massive MoE similar to Deep-seek. Excels on long context (4m tokens!) which is really interesting, need to dig into their lighting…
Chinese Lab MiniMax introduced in this week:
1. open-sourcing LLM MiniMax-M1 — setting new standards in long-context reasoning.
- World’s longest context window: 1M-token input, 80k-token output
- State-of-the-art agentic use among open-source models
- RL at unmatched efficiency: trained with just $534,700.
HF.
GitHub.
Tech Report.
2. Hailuo 02, World-Class Quality, Record-Breaking Cost Efficiency
- Best-in-class instruction following
- Handles extreme physics
- Native 1080p
3. MiniMax Audio:
- Any prompt, any voice, any emotion
- Fully customizable and multilingual.
4. Hailuo Video Agent in Beta, Vibe Videoing with Zero-touch.
MiniMax plan to achieve end-to-end Hailuo Video Agent via 3 stages:
Stage 1: Prebuilt video Agent templates for high-quality creative videos. Users simply follow instructions and input text or images — with one click, a polished video is generated.
Stage 2: Semi-customizable video Agent. Users gain the flexibility to edit any part of the video creation process, from script to visuals to voiceover.
Stage 3: Fully autonomous, end-to-end video Agent. A complete, intelligent workflow that turns creative input into final-cut video with minimal manual effort.
This summer, team plan to gradually roll out Stage Two of Agent creation tools.
5. MiniMax Agent, a general intelligent agent designed to tackle long-horizon, complex tasks.
From expert-level multi-step planning to flexible task breakdown and end-to-end execution — it’s designed to act like a reliable teammate, with strengths in:
-Programming & tool use
-Multimodal understanding & generation
-Seamless MCP integration
1. open-sourcing LLM MiniMax-M1 — setting new standards in long-context reasoning.
- World’s longest context window: 1M-token input, 80k-token output
- State-of-the-art agentic use among open-source models
- RL at unmatched efficiency: trained with just $534,700.
HF.
GitHub.
Tech Report.
2. Hailuo 02, World-Class Quality, Record-Breaking Cost Efficiency
- Best-in-class instruction following
- Handles extreme physics
- Native 1080p
3. MiniMax Audio:
- Any prompt, any voice, any emotion
- Fully customizable and multilingual.
4. Hailuo Video Agent in Beta, Vibe Videoing with Zero-touch.
MiniMax plan to achieve end-to-end Hailuo Video Agent via 3 stages:
Stage 1: Prebuilt video Agent templates for high-quality creative videos. Users simply follow instructions and input text or images — with one click, a polished video is generated.
Stage 2: Semi-customizable video Agent. Users gain the flexibility to edit any part of the video creation process, from script to visuals to voiceover.
Stage 3: Fully autonomous, end-to-end video Agent. A complete, intelligent workflow that turns creative input into final-cut video with minimal manual effort.
This summer, team plan to gradually roll out Stage Two of Agent creation tools.
5. MiniMax Agent, a general intelligent agent designed to tackle long-horizon, complex tasks.
From expert-level multi-step planning to flexible task breakdown and end-to-end execution — it’s designed to act like a reliable teammate, with strengths in:
-Programming & tool use
-Multimodal understanding & generation
-Seamless MCP integration
www.minimax.io
Building AGI with our mission Intelligence with Everyone. Global leader in multi-modal models and AI-native products with over 200 million users.
🔥8
New AI for rare disease diagnosis: SHEPHERD shows how simulation + knowledge-grounded AI = deep learning for ultra‑low label domains
SHEPHERD is a few‑shot learning model powered by a phenotypic knowledge graph to tackle over 7,000 rare diseases with just a handful (or zero) diagnosed cases.
SHEPHERD is a few‑shot learning model powered by a phenotypic knowledge graph to tackle over 7,000 rare diseases with just a handful (or zero) diagnosed cases.
🔥4
Sakana AI introduced Reinforcement-Learned Teachers (RLTs): Transforming how teach LLMs to reason with reinforcement learning (RL).
Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting.
RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students.
Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs.
RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL.
Paper.
Code.
Traditional RL focuses on “learning to solve” challenging problems with expensive LLMs and constitutes a key step in making student AI systems ultimately acquire reasoning capabilities via distillation and cold-starting.
RLTs—a new class of models prompted with not only a problem’s question but also its solution, and directly trained to generate clear, step-by-step “explanations” to teach their students.
Remarkably, an RLT with only 7B parameters produces superior results when distilling and cold-starting students in competitive and graduate-level reasoning tasks than orders-of-magnitude larger LLMs.
RLTs are as effective even when distilling 32B students, much larger than the teacher itself—unlocking a new standard for efficiency in developing reasoning language models with RL.
Paper.
Code.
sakana.ai
Sakana AI
Reinforcement Learning Teachers of Test Time Scaling
🔥6
Future of Work with AI Agents. Stanford's new report analyzes what 1500 workers think about working with AI Agents.
The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.
The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.
A substantial portion of current AI investment, such as YC-funded companies, targets tasks in the “Red Light” Zone (high technical feasibility but low worker desire).
This raises concerns about pushing automation where it's socially or ethically unwelcome.
Interpersonal skills are becoming more valuable
Tasks rated as needing HAS 5 (essential human involvement) were strongly associated with interpersonal communication and domain expertise.
These include editing, education, and some engineering tasks, where AI lacks the nuance or trustworthiness to operate alone.
Some High-Wage Skills May Decline in Value
The results above reveal that skills like analyzing data or updating knowledge, which currently command high wages, are less associated with high HAS tasks, implying future declines in their labor market value as AI spreads.
Role-based AI Support
From transcript analysis, the most common vision for human–AI collaboration was role-based support, where workers imagine AI tools acting as analysts, assistants, or specialists with clearly bounded responsibilities, not general-purpose agents.
Lots of other findings in this one.
The audit proposes a large-scale framework for understanding where AI agents should automate or augment human labor.
The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale to quantify desired human involvement in AI-agent-supported work.
A substantial portion of current AI investment, such as YC-funded companies, targets tasks in the “Red Light” Zone (high technical feasibility but low worker desire).
This raises concerns about pushing automation where it's socially or ethically unwelcome.
Interpersonal skills are becoming more valuable
Tasks rated as needing HAS 5 (essential human involvement) were strongly associated with interpersonal communication and domain expertise.
These include editing, education, and some engineering tasks, where AI lacks the nuance or trustworthiness to operate alone.
Some High-Wage Skills May Decline in Value
The results above reveal that skills like analyzing data or updating knowledge, which currently command high wages, are less associated with high HAS tasks, implying future declines in their labor market value as AI spreads.
Role-based AI Support
From transcript analysis, the most common vision for human–AI collaboration was role-based support, where workers imagine AI tools acting as analysts, assistants, or specialists with clearly bounded responsibilities, not general-purpose agents.
Lots of other findings in this one.
🔥5
BountyBench evaluates AI agents on 25 real-world, complex systems and 40 bug bounties (worth up to $30,000+), covering 9 OWASP Top 10 categories.
Key insights:
– AI agents solved bug bounty tasks worth tens of thousands of dollars
– Codex CLI & Claude Code excelled in patching (90% / 87.5%), vs in exploitation (32.5% / 57.5%)
– Custom agents performed more evenly across both: Exploit (40–67.5%), Patch (45–60%)
Key insights:
– AI agents solved bug bounty tasks worth tens of thousands of dollars
– Codex CLI & Claude Code excelled in patching (90% / 87.5%), vs in exploitation (32.5% / 57.5%)
– Custom agents performed more evenly across both: Exploit (40–67.5%), Patch (45–60%)
🔥5
All about AI, Web 3.0, BCI
⚡️❗️ Breaking Ground in BCI: Science (Neuralink's Competitor) Unveils Revolutionary Biohybrid Neural Technology Science, a neurotechnology company founded by former Neuralink President Max Hodak, has revealed a revolutionary approach to brain-computer interfaces…
⚡️ today Science submitted a full CE mark application for marketing approval in Europe for PRIMA retinal prosthesis.
With this key step, Science are moving closer to bringing to market the first brain-computer interface technology to restore functional form vision to patients blinded with late-stage age-related macular degeneration (AMD).
With this key step, Science are moving closer to bringing to market the first brain-computer interface technology to restore functional form vision to patients blinded with late-stage age-related macular degeneration (AMD).
Science Corporation
Science Submits CE Mark Application for PRIMA Retinal Implant – A Critical Step Towards Making It Available To Patients | Science…
Science Corporation is a clinical-stage medical technology company.
🔥4
Salesforce introduced Agentforce 3.0 + MCP
Connect Agents to any system, tool, or data source — securely, reliably, and at scale.
Connect Agents to any system, tool, or data source — securely, reliably, and at scale.
Salesforce
Agentforce: The AI Agent Platform
Build and customize autonomous AI agents to support your employees and customers 24/7, including full integration with the Salesforce ecosystem.
🔥4
Microsoft dropped a micro-sized, task-specific, on-device language model called Mu
It is offloaded fully onto NPUs on Copilot+ devices and powers real-time interactions, like the new Settings AI agent inside Windows 11.
It is offloaded fully onto NPUs on Copilot+ devices and powers real-time interactions, like the new Settings AI agent inside Windows 11.
Windows Experience Blog
Introducing Mu language model and how it enabled the agent in Windows Settings
We are excited to introduce our newest on-device small language model, Mu. This model addresses scenarios that require inferring complex input-output relationships and has been designed to operate efficiently, delivering high performance while runnin
🔥4