Harmonic by the founder of Robinhood dropped how they got a gold medal at the IMO 2025, the elite math contest.
4 teams have done this.
Harmonic Aristotle, unlike OpenAI and DeepMind, uses formal Lean-based search methods and a geometry solver like Bytedance SeedProver.
4 teams have done this.
Harmonic Aristotle, unlike OpenAI and DeepMind, uses formal Lean-based search methods and a geometry solver like Bytedance SeedProver.
alphaXiv
Aristotle: IMO-level Automated Theorem Proving | alphaXiv
View recent discussion. Abstract: We introduce Aristotle, an AI system that combines formal verification with informal reasoning, achieving gold-medal-equivalent performance on the 2025 International Mathematical Olympiad problems. Aristotle integrates three…
❤2🥰2👏2🔥1
Google introduced CodeMender: a new AI agent that uses Gemini Deep Think to automatically patch critical software vulnerabilities.
It checks whether its patches are functionally correct, can fix the root cause and doesn't break anything else. This ensures that only high-quality solutions are sent to humans for review.
CodeMender has already created and submitted 72 high-quality fixes for serious security issues in major open-source projects.
It can instantly patch new flaws as well as rewrite old code to eliminate entire classes of vulnerabilities – saving developers significant time.
It checks whether its patches are functionally correct, can fix the root cause and doesn't break anything else. This ensures that only high-quality solutions are sent to humans for review.
CodeMender has already created and submitted 72 high-quality fixes for serious security issues in major open-source projects.
It can instantly patch new flaws as well as rewrite old code to eliminate entire classes of vulnerabilities – saving developers significant time.
Google DeepMind
Introducing CodeMender: an AI agent for code security
CodeMender is a new AI-powered agent that improves code security automatically. It instantly patches new software vulnerabilities, and rewrites and secures existing code, eliminating entire...
❤3👏3
Life from OpenAI’s Dev day
YouTube
OpenAI DevDay 2025: Opening Keynote with Sam Altman
Sam Altman kicks off DevDay 2025 with a keynote to explore ideas that will challenge how you think about building. Join us for announcements, live demos, and a vision of how developers are reshaping the future with AI.
OpenAI introduced agentkit: build a high-quality agent for any vertical with visual builder, evals, guardrails, and other tools.
live demo of building a working agent in 8 minutes.
live demo of building a working agent in 8 minutes.
🔥8🥰2👏2
The gap between open and closed models are narrowing and this trend to continue.
As foundation models become commoditized on a global level, the most interesting directions from both research and commercial is not in their development but in finding new ways to use them.
On the Terminal-Bench Hard evaluation for agentic coding and terminal use, open-weights models such as DeepSeek V3.2 Exp, Kimi K2 0905, and GLM-4.6 have made large strides, with DeepSeek surpassing Gemini 2.5 Pro.
These advances reflect significantly higher capability for use in coding and other agent use cases, and developers have a wider range of model options than ever for these applications.
As foundation models become commoditized on a global level, the most interesting directions from both research and commercial is not in their development but in finding new ways to use them.
On the Terminal-Bench Hard evaluation for agentic coding and terminal use, open-weights models such as DeepSeek V3.2 Exp, Kimi K2 0905, and GLM-4.6 have made large strides, with DeepSeek surpassing Gemini 2.5 Pro.
These advances reflect significantly higher capability for use in coding and other agent use cases, and developers have a wider range of model options than ever for these applications.
Fantastic paper “Evolution Strategies at Scale: LLM Fine-Tuning Beyond Reinforcement Learning”
RL has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs.
Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space?
YES. Researchers propose a scalable framework for full-parameter fine-tuning using Evolution Strategies (ES).
By skipping gradients and optimizing directly in parameter space, ES achieves more accurate, efficient, and stable fine-tuning.
GitHub.
RL has long been the dominant method for fine-tuning, powering many state-of-the-art LLMs.
Methods like PPO and GRPO explore in action space. But can we instead explore directly in parameter space?
YES. Researchers propose a scalable framework for full-parameter fine-tuning using Evolution Strategies (ES).
By skipping gradients and optimizing directly in parameter space, ES achieves more accurate, efficient, and stable fine-tuning.
GitHub.
arXiv.org
Evolution Strategies at Scale: LLM Fine-Tuning Beyond...
Fine-tuning pre-trained large language models (LLMs) for down-stream tasks is a critical step in the AI deployment pipeline. Reinforcement learning (RL) is arguably the most prominent fine-tuning...
🔥3🥰2👏2
OpenAI DevDay 2025.
Highlights:
OpenAI grew from 2 million weekly developers and 100 million weekly ChatGPT users in 2023 to 4 million developers and 800M+ weekly ChatGPT users in 2025.
The platform now processes over 6 billion tokens per minute on the API, up from 300 million tokens per minute in 2023.
Apps inside ChatGPT
- OpenAI launched the Apps SDK in preview, built on Model Context Protocol, enabling developers to build real apps inside ChatGPT that are interactive, adaptive, and personalized. Docs.
- Launch partners include Booking, Canva, Coursera, Expedia, Figma, Spotify, and Zillow, with apps available today to all logged-in ChatGPT users outside of the EU on Free, Go, Plus and Pro plans
- OpenAI will support many ways to monetize including the new Agentic Commerce Protocol that offers instant checkout right inside ChatGPT
- Later this year, OpenAI will begin accepting app submissions for review and publication, launch a dedicated directory where users can browse and search for apps, and launch apps to ChatGPT Business, Enterprise and Edu (OpenAI expects to bring apps to EU users soon).
Building agents
- AgentKit includes Agent Builder (visual canvas for creating multi-agent workflows with drag-and-drop nodes, available in beta),
- ChatKit (toolkit for embedding customizable chat-based agent experiences, generally available starting today)
- expanded Evals capabilities (datasets, trace grading, automated prompt optimization, third-party model support)
- Connector Registry (beginning beta rollout to some API, ChatGPT Enterprise and Edu customers with a Global Admin Console) consolidates data sources into a single admin panel across ChatGPT and the API, including pre-built connectors like Dropbox, Google Drive, SharePoint, Microsoft Teams, and third-party MCP servers
- Guardrails is an open-source, modular safety layer that helps protect agents against unintended or malicious behavior, available to mask or flag PII, detect jailbreaks, and apply other safeguards
Writing code
- Codex is officially out of research preview and into general availability with new Slack integration, Codex SDK, and admin tools including environment controls, monitoring, and analytics dashboards
- Starting October 20, Codex cloud tasks will begin counting towards usage limits (Plus: 30-150 local messages or 5-40 cloud tasks every 5 hours, Pro: 300-1,500 local messages or 50-400 cloud tasks every 5 hours, with code review not counting toward limits for a limited time).
API updates
- gpt-5-pro (gpt-5-pro-2025-10-06) is now available in the API ($15 per 1M input tokens, $120 per 1M output tokens) for tasks in domains like finance, legal, and healthcare where you need high accuracy and depth of reasoning
- gpt-realtime-mini (gpt-realtime-mini-2025-10-06 - $0.60 per 1M text input tokens, $2.40 per 1M text output tokens, $10 per 1M audio input tokens, $20 per 1M audio output tokens) is 70% cheaper than the advanced voice model with the same voice quality and expressiveness
- gpt-audio-mini (gpt-audio-mini-2025-10-06 - $0.60 per 1M text input tokens, $2.40 per 1M text output tokens, $10 per 1M audio input tokens, $20 per 1M audio output tokens) provides cost-efficient audio processing
- sora-2 ($0.10 per second for 720x1280 or 1280x720) and sora-2-pro ($0.30 per second for 720x1280 or 1280x720, $0.50 per second for 1024x1792 or 1792x1024) are available in preview in the API with the ability to pair sound with visuals including rich soundscapes, ambient audio, and synchronized effects, plus control over video length, aspect ratio, resolution, and the ability to easily remix videos
- gpt-image-1-mini ($2 per 1M text input tokens, $2.50 per 1M image input tokens, $8 per 1M image output tokens, $0.005-$0.015 per image depending on quality and size) is 80% less expensive than the large model
Highlights:
OpenAI grew from 2 million weekly developers and 100 million weekly ChatGPT users in 2023 to 4 million developers and 800M+ weekly ChatGPT users in 2025.
The platform now processes over 6 billion tokens per minute on the API, up from 300 million tokens per minute in 2023.
Apps inside ChatGPT
- OpenAI launched the Apps SDK in preview, built on Model Context Protocol, enabling developers to build real apps inside ChatGPT that are interactive, adaptive, and personalized. Docs.
- Launch partners include Booking, Canva, Coursera, Expedia, Figma, Spotify, and Zillow, with apps available today to all logged-in ChatGPT users outside of the EU on Free, Go, Plus and Pro plans
- OpenAI will support many ways to monetize including the new Agentic Commerce Protocol that offers instant checkout right inside ChatGPT
- Later this year, OpenAI will begin accepting app submissions for review and publication, launch a dedicated directory where users can browse and search for apps, and launch apps to ChatGPT Business, Enterprise and Edu (OpenAI expects to bring apps to EU users soon).
Building agents
- AgentKit includes Agent Builder (visual canvas for creating multi-agent workflows with drag-and-drop nodes, available in beta),
- ChatKit (toolkit for embedding customizable chat-based agent experiences, generally available starting today)
- expanded Evals capabilities (datasets, trace grading, automated prompt optimization, third-party model support)
- Connector Registry (beginning beta rollout to some API, ChatGPT Enterprise and Edu customers with a Global Admin Console) consolidates data sources into a single admin panel across ChatGPT and the API, including pre-built connectors like Dropbox, Google Drive, SharePoint, Microsoft Teams, and third-party MCP servers
- Guardrails is an open-source, modular safety layer that helps protect agents against unintended or malicious behavior, available to mask or flag PII, detect jailbreaks, and apply other safeguards
Writing code
- Codex is officially out of research preview and into general availability with new Slack integration, Codex SDK, and admin tools including environment controls, monitoring, and analytics dashboards
- Starting October 20, Codex cloud tasks will begin counting towards usage limits (Plus: 30-150 local messages or 5-40 cloud tasks every 5 hours, Pro: 300-1,500 local messages or 50-400 cloud tasks every 5 hours, with code review not counting toward limits for a limited time).
API updates
- gpt-5-pro (gpt-5-pro-2025-10-06) is now available in the API ($15 per 1M input tokens, $120 per 1M output tokens) for tasks in domains like finance, legal, and healthcare where you need high accuracy and depth of reasoning
- gpt-realtime-mini (gpt-realtime-mini-2025-10-06 - $0.60 per 1M text input tokens, $2.40 per 1M text output tokens, $10 per 1M audio input tokens, $20 per 1M audio output tokens) is 70% cheaper than the advanced voice model with the same voice quality and expressiveness
- gpt-audio-mini (gpt-audio-mini-2025-10-06 - $0.60 per 1M text input tokens, $2.40 per 1M text output tokens, $10 per 1M audio input tokens, $20 per 1M audio output tokens) provides cost-efficient audio processing
- sora-2 ($0.10 per second for 720x1280 or 1280x720) and sora-2-pro ($0.30 per second for 720x1280 or 1280x720, $0.50 per second for 1024x1792 or 1792x1024) are available in preview in the API with the ability to pair sound with visuals including rich soundscapes, ambient audio, and synchronized effects, plus control over video length, aspect ratio, resolution, and the ability to easily remix videos
- gpt-image-1-mini ($2 per 1M text input tokens, $2.50 per 1M image input tokens, $8 per 1M image output tokens, $0.005-$0.015 per image depending on quality and size) is 80% less expensive than the large model
Openai
OpenAI DevDay 2025
Explore all the announcements from OpenAI DevDay 2025, including apps in ChatGPT, AgentKit, Sora 2, and more. Access blogs, docs, and resources to help you build with the latest tools.
🔥3❤2🥰2😁1
Excel Add-in with Claude AI integration
Take actions in Excel - Build financial models, Analyze customer behavior, Transform messy data.
Now available for max plan users.
Take actions in Excel - Build financial models, Analyze customer behavior, Transform messy data.
Now available for max plan users.
pivot.claude.ai
Claude Excel Add-in
Excel Add-in with Claude AI integration
Google expanded access to 15 new countries so more people can build AI-powered mini-apps — no code required.
Also launched new features like advanced debugging and a faster building experience.
Also launched new features like advanced debugging and a faster building experience.
Google
Expanding access to Opal, our no-code AI mini-app builder
We’re bringing Opal to 15 new countries and making it even easier to build.
🔥4❤2👏2🥰1
Wow! Researchers introduced a new RL algo to train agents who can build other agents
Weak-for-Strong (W4S): Training a Weak Meta-Agent to Harness Strong Executors.
With this, SLMs become powerful meta-agents that manage frontier LLMs in diverse agentic tasks.
Code.
Weak-for-Strong (W4S): Training a Weak Meta-Agent to Harness Strong Executors.
With this, SLMs become powerful meta-agents that manage frontier LLMs in diverse agentic tasks.
Code.
arXiv.org
Weak-for-Strong: Training Weak Meta-Agent to Harness Strong Executors
Efficiently leveraging of the capabilities of contemporary large language models (LLMs) is increasingly challenging, particularly when direct fine-tuning is expensive and often impractical....
❤5👍4🍌1
Anthropic is preparing Claude Code to be released on the mobile app
It now runs on Anthropic infrastructure not just on GitHub anymore.
Users will be able to connect Claude app to GitHub and run their coding prompts on the go.
It now runs on Anthropic infrastructure not just on GitHub anymore.
Users will be able to connect Claude app to GitHub and run their coding prompts on the go.
TestingCatalog
Anthropic prepares Claude Code release for mobile apps
Anthropic prepares a Code section on web and mobile with GitHub integration, repository browsing, and Claude Code tasks tailored to developers.
👍3🔥3🥰2
Google introduced Gemini 2.5 Computer Use
- Control UIs based with vision understanding and reasoning
- Use for web and Android control
- Try it with Browserbase or locally
- Control UIs based with vision understanding and reasoning
- Use for web and Android control
- Try it with Browserbase or locally
Google
Introducing the Gemini 2.5 Computer Use model
Today we are releasing the Gemini 2.5 Computer Use model via the API, which outperforms leading alternatives at browser and mobile tasks.
👍3🔥3🥰2
MIT Media Lab introduced NeuroChat: neuroadaptive chatbot that adapts its responses to your cognitive engagement.
NeuroChat is the first to use generative AI to adapt on the fly.
Every response — tone, depth, pacing — is co-authored by your brain and the model.
By reading brain signals in real time, NeuroChat personalizes its teaching style to your attention, curiosity, and focus.
Here’s how it works:
NeuroChat measures real-time brain activity using EEG - a lightweight, noninvasive sensor that captures your level of engagement while you learn.
The chatbot uses this engagement score to adjust how it teaches - simplifying, deepening, or changing pace to match your focus.
A live feedback loop between your mind and the model.
NeuroChat isn’t mind-reading.
It tracks only engagement signals, not specific thoughts, memories, or emotions.
All processing happens in your browser, and it’s compatible with local AI models, keeping brain data private.
Researchers ran a pilot study (n = 24) comparing NeuroChat to a non-adaptive chatbot:
1. EEG engagement: significantly higher with NeuroChat
2. Self-reports: users described it as more human-like, fluid, and enjoyable
3. Learning: similar short-term scores, but stronger sustained focus and curiosity
NeuroChat is the first to use generative AI to adapt on the fly.
Every response — tone, depth, pacing — is co-authored by your brain and the model.
By reading brain signals in real time, NeuroChat personalizes its teaching style to your attention, curiosity, and focus.
Here’s how it works:
NeuroChat measures real-time brain activity using EEG - a lightweight, noninvasive sensor that captures your level of engagement while you learn.
The chatbot uses this engagement score to adjust how it teaches - simplifying, deepening, or changing pace to match your focus.
A live feedback loop between your mind and the model.
NeuroChat isn’t mind-reading.
It tracks only engagement signals, not specific thoughts, memories, or emotions.
All processing happens in your browser, and it’s compatible with local AI models, keeping brain data private.
Researchers ran a pilot study (n = 24) comparing NeuroChat to a non-adaptive chatbot:
1. EEG engagement: significantly higher with NeuroChat
2. Self-reports: users described it as more human-like, fluid, and enjoyable
3. Learning: similar short-term scores, but stronger sustained focus and curiosity
Tally Forms
NeuroChat Public Release Waitlist
Made with Tally, the simplest way to create forms.
❤3🔥2🥰2
Google shared a new paper, which provides an extremely sample-efficient way to create an agent that can perform well in multi-agent, partially-observed, symbolic environments.
The key idea is to use LLM-powered code synthesis to learn a code world model (in the form of Python code) from a small dataset of (observation, action) trajectories, plus some background information (in text form), and then to pass this induced WM, plus the observation history, to an existing solver, such as (information-set) MCTS, to choose the next action.
Researchers applied their method to various two-player games (with both perfect and imperfect information), and show that it works much better than prompting the LLM to directly generate actions, especially for novel games.
In particular, we beat Gemini 2.5 Pro in 7 out of 10 games, and tie it in 2 out of 10 games.
The key idea is to use LLM-powered code synthesis to learn a code world model (in the form of Python code) from a small dataset of (observation, action) trajectories, plus some background information (in text form), and then to pass this induced WM, plus the observation history, to an existing solver, such as (information-set) MCTS, to choose the next action.
Researchers applied their method to various two-player games (with both perfect and imperfect information), and show that it works much better than prompting the LLM to directly generate actions, especially for novel games.
In particular, we beat Gemini 2.5 Pro in 7 out of 10 games, and tie it in 2 out of 10 games.
arXiv.org
Code World Models for General Game Playing
Large Language Models (LLMs) reasoning abilities are increasingly being applied to classical board and card games, but the dominant approach -- involving prompting for direct move generation --...
🔥2🥰2👏2
Anthropic just now introduced Claude Code Plugins in public beta.
Plugins allow you to install and share curated collections of slash commands, agents, MCP servers, and hooks directly within Claude Code.
To get started, you can add a marketplace using: /plugin marketplace add user-or-org/repo-name.
Then browse and install from the /plugin menu.
Try out the multi-agent workflow we use to develop Claude Code:
/plugin marketplace add anthropics/claude-code
/plugin install feature-dev
Anyone can host a marketplace or make a plugin. All you need is a git repo with a .claude-plugin/marketplace.json file.
Plugins allow you to install and share curated collections of slash commands, agents, MCP servers, and hooks directly within Claude Code.
To get started, you can add a marketplace using: /plugin marketplace add user-or-org/repo-name.
Then browse and install from the /plugin menu.
Try out the multi-agent workflow we use to develop Claude Code:
/plugin marketplace add anthropics/claude-code
/plugin install feature-dev
Anyone can host a marketplace or make a plugin. All you need is a git repo with a .claude-plugin/marketplace.json file.
Claude
Customize Claude Code with plugins | Claude
Claude Code now supports plugins: custom collections of slash commands, agents, MCP servers, and hooks that install with a single command. Share your Claude Code setup with plugins Slash commands, agents, MCP servers, and hooks are all extension points you…
🔥9❤2🥰2
Google introduced Gemini Enterprise with Agents, Connectors and Agent Builder
“It allows you to chat with your company’s documents, data and apps as well as build and deploy AI agents, all grounded in your information and context.“
“It allows you to chat with your company’s documents, data and apps as well as build and deploy AI agents, all grounded in your information and context.“
Google Cloud Blog
Introducing Gemini Enterprise | Google Cloud Blog
Today, we’re introducing Gemini Enterprise – the new front door for AI in the workplace. It’s our advanced agentic platform that brings the best of Google AI to every employee, for every workflow.
❤2🔥2👏2
AI-Driven Research for Systems released “Barbarians at the Gate: How AI is Upending Systems Research”
Researchers show how AI-Driven Research for Systems (ADRS) can rediscover or outperform human-designed algorithms across cloud scheduling, MoE expert load balancing, LLM-SQL optimization, transaction scheduling, and more — all within hours and under $20.
Code.
Researchers show how AI-Driven Research for Systems (ADRS) can rediscover or outperform human-designed algorithms across cloud scheduling, MoE expert load balancing, LLM-SQL optimization, transaction scheduling, and more — all within hours and under $20.
Code.
GitHub
GitHub - UCB-ADRS/ADRS: AI-Driven Research Systems (ADRS)
AI-Driven Research Systems (ADRS) . Contribute to UCB-ADRS/ADRS development by creating an account on GitHub.
❤4🔥2🦄2👏1
Elon Musk’s xAI is developing world models which are AI systems capable of understanding and designing physical environments.
To achieve this, xAI has hired former Nvidia specialists.
These world models are considered a more advanced form of AI than the LLMs trained primarily on text data, and are expected to surpass the limitations of popular AI tools such as ChatGPT and xAI’s own Grok.
xAI plans to apply world models first in the gaming sector.
The models could be used to automatically generate interactive 3D environments and potentially be applied to AI systems for robotics.
Some tech companies expect world models to become a next-generation core technology that could extend AI applications beyond software into physical products, such as humanoid robots.
Last month, Nvidia told the Financial Times that the market potential of world models could be nearly as large as the entire global economy.
Musk also reaffirmed his earlier goal by posting on X that xAI will release a “great AI-generated game” by the end of next year.
To achieve this, xAI has hired former Nvidia specialists.
These world models are considered a more advanced form of AI than the LLMs trained primarily on text data, and are expected to surpass the limitations of popular AI tools such as ChatGPT and xAI’s own Grok.
xAI plans to apply world models first in the gaming sector.
The models could be used to automatically generate interactive 3D environments and potentially be applied to AI systems for robotics.
Some tech companies expect world models to become a next-generation core technology that could extend AI applications beyond software into physical products, such as humanoid robots.
Last month, Nvidia told the Financial Times that the market potential of world models could be nearly as large as the entire global economy.
Musk also reaffirmed his earlier goal by posting on X that xAI will release a “great AI-generated game” by the end of next year.
Ft
Elon Musk’s xAI joins race to build ‘world models’ to power video games
Artificial intelligence group hired staff from Nvidia to work on advanced AI that can design and navigate physical spaces
🔥3👏3🥰2👍1