StepFun released GELab-Zero-4B-preview β a 4B multimodal GUI agent fine-tuned for Android.
It understands taps, swipes, typing & waits, and can perform complex, multi-app tasks.
Built on Qwen3-VL-4B-Instruct.
HuggingFace.
GitHub.
It understands taps, swipes, typing & waits, and can perform complex, multi-app tasks.
Built on Qwen3-VL-4B-Instruct.
HuggingFace.
GitHub.
huggingface.co
stepfun-ai/GELab-Zero-4B-preview Β· Hugging Face
Weβre on a journey to advance and democratize artificial intelligence through open source and open science.
π4π₯3π₯°2
#DeepSeek just launched DeepSeek-V3.2 & DeepSeek-V3.2-Speciale β Reasoning-first models built for agents
1. DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
2. DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.
Thinking in Tool-Use:
- Introduced a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.
- DeepSeek-V3.2 integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.
API update:
- V3.2: Same usage pattern as V3.2-Exp.
- V3.2-Speciale: Served via a temporary endpoint: base_url="
Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).
V3.2 now supports Thinking in Tool-Use β details
1. DeepSeek-V3.2: Official successor to V3.2-Exp. Now live on App, Web & API.
2. DeepSeek-V3.2-Speciale: Pushing the boundaries of reasoning capabilities. API-only for now.
Thinking in Tool-Use:
- Introduced a new massive agent training data synthesis method covering 1,800+ environments & 85k+ complex instructions.
- DeepSeek-V3.2 integrate thinking directly into tool-use, and also supports tool-use in both thinking and non-thinking modes.
API update:
- V3.2: Same usage pattern as V3.2-Exp.
- V3.2-Speciale: Served via a temporary endpoint: base_url="
Same pricing as V3.2, no tool calls, available until Dec 15th, 2025, 15:59 (UTC Time).
V3.2 now supports Thinking in Tool-Use β details
π4π₯3π₯°2
Google introduced Budget Tracker for smarter AI agents
Current LLM agents waste tool-call budgets.
This work unveils Budget Tracker and BATS, enabling agents to dynamically adapt planning based on remaining resources.
Current LLM agents waste tool-call budgets.
This work unveils Budget Tracker and BATS, enabling agents to dynamically adapt planning based on remaining resources.
π₯4π₯°3π2
We have a new best text-to-video model that beats Google's Veo. Runway Gen-4.5, or Whisper Thunder, has +20 ELO on preference data over Veo 3, the difference between Veo 3 and Sora 2 Pro.
Does text-to-vid, image-to-vid, keyframes. 5-10s of output. No audio.
Does text-to-vid, image-to-vid, keyframes. 5-10s of output. No audio.
Runwayml
Runway Research | Introducing Runway Gen-4.5
A new frontier for video generation. State-of-the-art motion quality, prompt adherence and visual fidelity.
π₯4π3π3
Sam Altman told staff today that he was declaring a βcode redβ as ChatGPT faces growing threats from Google and other AI makers.
He wrote that heβs marshaling more resources to improve model behavior and other features in the chatbot.
In an internal Slack memo, Sam said he's directing more employees to work on improving ChatGPT for over 800 million weekly users, with key code red priorities including personalizing the chatbot so each person can customize how it interacts with them, improving ImageGen, improving model behavior, boosting speed and reliability, and minimizing overrefusals
OpenAI is delaying ads (which the company is testing but hasn't publicly acknowledged, according to a person with knowledge of the plans), AI agents (which aim to automate tasks related to shopping and health), Pulse, and plans to release a new reasoning model next week that Sam said beats Google's Gemini 3 in OpenAI's internal tests
He wrote that heβs marshaling more resources to improve model behavior and other features in the chatbot.
In an internal Slack memo, Sam said he's directing more employees to work on improving ChatGPT for over 800 million weekly users, with key code red priorities including personalizing the chatbot so each person can customize how it interacts with them, improving ImageGen, improving model behavior, boosting speed and reliability, and minimizing overrefusals
OpenAI is delaying ads (which the company is testing but hasn't publicly acknowledged, according to a person with knowledge of the plans), AI agents (which aim to automate tasks related to shopping and health), Pulse, and plans to release a new reasoning model next week that Sam said beats Google's Gemini 3 in OpenAI's internal tests
The Information
OpenAI CEO Declares βCode Redβ to Combat Threats to ChatGPT, Delays Ads Effort
OpenAI CEO Sam Altman on Monday told employees he was declaring a βcode redβ to marshalmore resources to improve ChatGPT as threats rise from Google and other artificial intelligence competitors, according to an internal memo. As a result,OpenAI plans toβ¦
π4π₯3π₯°3
The world's first Co-Scientist integrating AI and XR. Meet LabOS.
It uses multimodal perception, self-evolving agents, and XR tools to see what researchers see, grasp experimental context, and assist in real time.
From cancer immunotherapy target discovery to stem-cell engineering, it turns labs into collaborative spaces where human insight and machine smarts evolve together, proving modern science moves fastest when thought and action team up.
Paper
It uses multimodal perception, self-evolving agents, and XR tools to see what researchers see, grasp experimental context, and assist in real time.
From cancer immunotherapy target discovery to stem-cell engineering, it turns labs into collaborative spaces where human insight and machine smarts evolve together, proving modern science moves fastest when thought and action team up.
Paper
arXiv.org
LabOS: The AI-XR Co-Scientist That Sees and Works With Humans
Modern science advances fastest when thought meets action. LabOS represents the first AI co-scientist that unites computational reasoning with physical experimentation through multimodal...
β€6π6π₯5π1
Mistral released the Mistral 3 family of models
Small models Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions.
And Mistral Large 3, a frontier class open source MoE. Apache 2.0.
Small models Ministral 3 (14B, 8B, 3B), each released with base, instruct and reasoning versions.
And Mistral Large 3, a frontier class open source MoE. Apache 2.0.
mistral.ai
Introducing Mistral 3 | Mistral AI
A family of frontier open-source multimodal models
π₯6β€3π2
Shopify just shipped Tangle - the first open source experimentation platform with content-based caching and visual editor that's actually pleasant to use.
The CPU time savings alone are ridiculous (seeing 1+ year saved at Shopify).
The CPU time savings alone are ridiculous (seeing 1+ year saved at Shopify).
Shopify
Tangle: An open-source ML experimentation platform built for scale (2025) - Shopify
Tangle saves months of compute time, makes every experiment automatically reproducible, and allows teammates to share computation without coordination.
π₯6π3π₯°2
Diffusion Language Models are hyped lately, but hard to reproduce due to missing frameworks and high training costs.
Berkeley and UIUC show a surprisingly simple path: using their dLLM toolkit, they teach BERT to chat via discrete diffusion.
No generative pretraining, about 50 GPU hours, and ModernBERT large chat v0 reaches near Qwen1.5 0.5B quality with only lightweight SFT.
Even better, they open sourced the full training and inference pipeline plus a Hello World example, along with the extensible dllm framework. Efficient, cheap, and beginner friendly.
Models.
Berkeley and UIUC show a surprisingly simple path: using their dLLM toolkit, they teach BERT to chat via discrete diffusion.
No generative pretraining, about 50 GPU hours, and ModernBERT large chat v0 reaches near Qwen1.5 0.5B quality with only lightweight SFT.
Even better, they open sourced the full training and inference pipeline plus a Hello World example, along with the extensible dllm framework. Efficient, cheap, and beginner friendly.
Models.
GitHub
GitHub - ZHZisZZ/dllm: dLLM: Simple Diffusion Language Modeling
dLLM: Simple Diffusion Language Modeling. Contribute to ZHZisZZ/dllm development by creating an account on GitHub.
β€3π₯3π3
UK passes law officially recognizing crypto as third kind of property
The Block
UK passes law officially recognizing crypto as third kind of property
Local industry body CryptoUK said this gives crypto a "clearer legal footing" in related crimes or litigation.
π3π₯2π₯°2
A promising step toward practical, efficient compute in memory systems
A new memristor based ADC with adaptive quantization shows the possibility: analog AI hardware could unlock its full potential without bulky converters in the way.
It delivers strong CIFAR10 and ImageNet performance at just 5 bits, achieves up to 15.1x better energy efficiency and 12.9x smaller area, and cuts CIM system overhead by more than half.
A new memristor based ADC with adaptive quantization shows the possibility: analog AI hardware could unlock its full potential without bulky converters in the way.
It delivers strong CIFAR10 and ImageNet performance at just 5 bits, achieves up to 15.1x better energy efficiency and 12.9x smaller area, and cuts CIM system overhead by more than half.
Nature
Memristor-based adaptive analog-to-digital conversion for efficient and accurate compute-in-memory
Nature Communications - Hong et al. report an adaptive memristor-based analog-to-digital converter which leverages the programmable nature of memristors to implement optimal, data-aware...
π₯3π₯°3π3
OpenAI published blog post stating: confessions can keep language models honest.
Poof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
Even when models learn to cheat, theyβll still admit it...
Poof-of-concept method that trains models to report when they break instructions or take unintended shortcuts.
Even when models learn to cheat, theyβll still admit it...
Openai
How confessions can keep language models honest
OpenAI researchers are testing βconfessions,β a method that trains models to admit when they make mistakes or act undesirably, helping improve AI honesty, transparency, and trust in model outputs.
π₯3π₯°3π2
Google introduced the Massive Sound Embedding Benchmark (MSEB).
This new open-source framework evaluates universal sound understanding across 8 core tasks, from retrieval to reconstruction, in order to accelerate progress in multimodal AI.
This new open-source framework evaluates universal sound understanding across 8 core tasks, from retrieval to reconstruction, in order to accelerate progress in multimodal AI.
research.google
From Waveforms to Wisdom: The New Benchmark for Auditory Intelligence
β€3π3π₯2
Best Paper(DB track) Award at #NeurIPS2025 for Artificial Hivemind
Researchers from University of Washington, CMU, and Allen Institute have identified a fundamental problem in modern language models - the "Artificial Hivemind effect". HuggingFace.
Different models independently generate identical responses to open-ended questions. GPT-4, Qwen, Llama, Mixtral - all write "time is a river" when asked for a metaphor about time.
Average semantic similarity across different model families: 71-82%. This isn't a bug in one model. It's a systemic property of current LLM training paradigms.
The study covers 70+ models using the INFINITY-CHAT dataset:
- 26K real-world open-ended queries from WildChat
- 17 categories (from creative writing to philosophical questions)
- 31,250 human annotations (25 independent annotators per example)
Two forms of collapse:
β’ Intra-model: a single model repeats itself with pairwise similarity >0.8 in 79% of cases (even at temperature=1.0)
β’ Inter-model: different models produce identical phrases and structures.
Critical finding: LLM judges and reward models systematically fail when evaluating alternative responses of similar quality. Correlation with humans drops from 0.4 to 0.05 on examples with diverse content.
For business:
This creates an "AI feedback loop" - models are trained based on evaluations from other models that are themselves poorly calibrated for diversity.
Implications: β Reduced innovation potential in AI assistants β Standardization of creative content β Loss of alternative perspectives in strategic analysis β Risk of homogenizing user thinking patterns.
The future of AI should not be echoes of one voice, but a chorus of many.
Researchers from University of Washington, CMU, and Allen Institute have identified a fundamental problem in modern language models - the "Artificial Hivemind effect". HuggingFace.
Different models independently generate identical responses to open-ended questions. GPT-4, Qwen, Llama, Mixtral - all write "time is a river" when asked for a metaphor about time.
Average semantic similarity across different model families: 71-82%. This isn't a bug in one model. It's a systemic property of current LLM training paradigms.
The study covers 70+ models using the INFINITY-CHAT dataset:
- 26K real-world open-ended queries from WildChat
- 17 categories (from creative writing to philosophical questions)
- 31,250 human annotations (25 independent annotators per example)
Two forms of collapse:
β’ Intra-model: a single model repeats itself with pairwise similarity >0.8 in 79% of cases (even at temperature=1.0)
β’ Inter-model: different models produce identical phrases and structures.
Critical finding: LLM judges and reward models systematically fail when evaluating alternative responses of similar quality. Correlation with humans drops from 0.4 to 0.05 on examples with diverse content.
For business:
This creates an "AI feedback loop" - models are trained based on evaluations from other models that are themselves poorly calibrated for diversity.
Implications: β Reduced innovation potential in AI assistants β Standardization of creative content β Loss of alternative perspectives in strategic analysis β Risk of homogenizing user thinking patterns.
The future of AI should not be echoes of one voice, but a chorus of many.
arXiv.org
Artificial Hivemind: The Open-Ended Homogeneity of Language Models...
Language models (LMs) often struggle to generate diverse, human-like creative content, raising concerns about the long-term homogenization of human thought through repeated exposure to similar...
π₯8β€5π4
The official announcement is pending, but Google is signing a multi-year partnership with Replit.
CNBC
Google partners with Replit, in vibe-coding push
Replit will expand usage of Google Cloud services, add more of Google's models onto its platform, and support AI coding use cases for enterprise customers.
π₯5β€4π2
Anthropic released Interviewer that lets interview people at scale by using Claude
This helps expand the kind of research you can do.
This helps expand the kind of research you can do.
Anthropic
Introducing Anthropic Interviewer: What 1,250 professionals told us about working with AI
What 1,250 professionals told us about working with AI
β€4π4π₯4
Is this Yann LeCunβs first paper after leaving Meta? It demonstrates how humanoid robots can mimic actions from AI-generated videos, which are often too noisy for direct imitation.
The system lifts the video into 3D keypoints and then uses a physics-aware policy to execute the motions, enabling zero-shot control.
They implemented this on the Unitree G1 humanoid robot.
The system lifts the video into 3D keypoints and then uses a physics-aware policy to execute the motions, enabling zero-shot control.
They implemented this on the Unitree G1 humanoid robot.
π₯5β€4π₯°2
OpenRouter collaborated with a16z to publish the State of AI - an empirical report on how LLMs have been used on OpenRouter.
After analyzing more than 100 trillion tokens across hundreds of models and 3+ million users (excluding 3rd party) from the last year.
A lot of insights:
1. One finding: OpenRouter observe a Cinderella "Glass Slipper" effect for new models.
Early users a new LLM either churn quickly or become part of a foundational cohort, with much higher retention than others. They are early adopters who can "lead" the rest of the market.
2. Open vs Closed Weights:
By late 2025, open-weight models (abbreviated as OSS below) reached ~β of usage, sustained beyond launch spikes, but have plateaued in Q4.
3. Chinese models: grew from ~1% to around 30% in some weeks. Release velocity + quality make the market lively.
If you want a single picture of the modern stack:
- Closed models = high-value workloads
- Open models = high-volume workloads
And what we have seen is that a lot of teams use both.
OSS isn't "just for tinkering" - it is extremely popular in two areas:
β’ Roleplay / creative dialogue: >50% of OSS usage
β’ Programming assistance: ~15-20%.
4. Now the significant platform shift: agentic inference
Tracked it via:
- reasoning model adoption
- tool calling
- prompt/completion βshapeβ (sequence lengths).
5. Reasoning models go from βnegligibleβ to more than 50% of tokens in 2025. Full paradigm shift.
6. Languages: English dominates with more than 80% of tokens, but the tail is real - Chinese, Russian, Spanish, etc.
7. Economics: price matters, but less than you think.On cost vs usage map, the trendline is nearly flat: reducing cost by 10% only correlates with ~0.5-0.7% more usage.
After analyzing more than 100 trillion tokens across hundreds of models and 3+ million users (excluding 3rd party) from the last year.
A lot of insights:
1. One finding: OpenRouter observe a Cinderella "Glass Slipper" effect for new models.
Early users a new LLM either churn quickly or become part of a foundational cohort, with much higher retention than others. They are early adopters who can "lead" the rest of the market.
2. Open vs Closed Weights:
By late 2025, open-weight models (abbreviated as OSS below) reached ~β of usage, sustained beyond launch spikes, but have plateaued in Q4.
3. Chinese models: grew from ~1% to around 30% in some weeks. Release velocity + quality make the market lively.
If you want a single picture of the modern stack:
- Closed models = high-value workloads
- Open models = high-volume workloads
And what we have seen is that a lot of teams use both.
OSS isn't "just for tinkering" - it is extremely popular in two areas:
β’ Roleplay / creative dialogue: >50% of OSS usage
β’ Programming assistance: ~15-20%.
4. Now the significant platform shift: agentic inference
Tracked it via:
- reasoning model adoption
- tool calling
- prompt/completion βshapeβ (sequence lengths).
5. Reasoning models go from βnegligibleβ to more than 50% of tokens in 2025. Full paradigm shift.
6. Languages: English dominates with more than 80% of tokens, but the tail is real - Chinese, Russian, Spanish, etc.
7. Economics: price matters, but less than you think.On cost vs usage map, the trendline is nearly flat: reducing cost by 10% only correlates with ~0.5-0.7% more usage.
OpenRouter
State of AI 2025: 100T Token LLM Usage Study | OpenRouter
Read OpenRouter's comprehensive 2025 State of AI report β an empirical 100 trillion token study of real LLM usage, model trends, and developer ecosystem insights.
β€5π₯5π₯°4
Meta published a new paper on what is the path to safer superintelligence: co-improvement.
Everyone is focused on self-improving AI, but:
1) we don't know how to do it yet, and
2) it might be misaligned with humans.
Co-improvement: instead, build AI that collaborates with us to solve AI faster, and to help fix the alignment problem together.
Everyone is focused on self-improving AI, but:
1) we don't know how to do it yet, and
2) it might be misaligned with humans.
Co-improvement: instead, build AI that collaborates with us to solve AI faster, and to help fix the alignment problem together.
arXiv.org
AI & Human Co-Improvement for Safer Co-Superintelligence
Self-improvement is a goal currently exciting the field of AI, but is fraught with danger, and may take time to fully achieve. We advocate that a more achievable and better goal for humanity is to...
π₯5π₯°3π3
Nvidia introduced CUDA 13.1. It is the biggest expansion of CUDA since it launched in 2006.
CUDA Tile, a new way to program GPUs that makes powerful AI and accelerated computing easier for more developers to use.
CUDA Tile, a new way to program GPUs that makes powerful AI and accelerated computing easier for more developers to use.
GitHub
GitHub - NVIDIA/cutile-python: cuTile is a programming model for writing parallel kernels for NVIDIA GPUs
cuTile is a programming model for writing parallel kernels for NVIDIA GPUs - NVIDIA/cutile-python
β€6π₯2π₯°2
All about AI, Web 3.0, BCI
Essential AI just dropped Essential-Web v1.0, a 24-trillion-token pre-training dataset with rich metadata built to effortlessly curate high-performing datasets across domains and use cases Researchers labeled 23.6B documents from Common Crawl with a 12-categoryβ¦
Essential AI introduced their first open models, Rnj-1 base and instruct 8B parameter models.
Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI.
Lots of wins with Rnj-1.
1. SWE bench performance close to GPT 4o.
2. Tool use outperforming all comparable open source models.
3. Mathematical reasoning (AIMEβ25) nearly at par with GPT OSS MoE 20B.
Rnj-1 is the culmination of 10 months of hard work by a phenomenal team, dedicated to advancing American SOTA OSS AI.
Lots of wins with Rnj-1.
1. SWE bench performance close to GPT 4o.
2. Tool use outperforming all comparable open source models.
3. Mathematical reasoning (AIMEβ25) nearly at par with GPT OSS MoE 20B.