AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 Code as Agent Harness

💡 The paper discusses the concept of code as agent harness, where large language models are used as operational substrates for agent reasoning and execution in agentic systems. The authors argue that code is no longer just a target output, but serves as a unified infrastructure layer across multiple domains and applications. They introduce a unified view that centers code as the basis for agent infrastructure, and organize their survey around three connected layers: the harness interface, harness mechanisms, and scaling the harness.

The harness interface layer explores how code connects agents to reasoning, action, and environment modeling. The harness mechanisms layer examines planning, memory, and tool use for long-horizon execution, as well as feedback-driven control and optimization. The scaling layer discusses how to extend the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification.

The authors summarize representative methods and practical applications of code as agent harness, including coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. They also outline open challenges for harness engineering, such as evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments.

The paper provides a unified roadmap toward executable, verifiable, and stateful AI agent systems by centering code as the harness of agentic AI. The authors demonstrate the potential of code as agent harness to enable more efficient, adaptable, and reliable agent systems, and highlight the need for further research in harness engineering to address the open challenges and limitations of this approach. Overall, the paper contributes to the development of agentic systems by providing a new perspective on the role of code in agent infrastructure and highlighting the potential benefits and challenges of this approach.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18747
• PDF: https://arxiv.org/pdf/2605.18747

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticSystems #LargeLanguageModels #AgentReasoning #CodeAsInfrastructure #ArtificialIntelligence
3
🔥 TideGS: Scalable Training of Over One Billion 3D Gaussian Splatting Primitives via Out-of-Core Optimization

💡 The paper introduces TideGS, a scalable training framework for 3D Gaussian Splatting with over one billion primitives on a single GPU. The problem with training 3D Gaussian Splatting at a large scale is that it is memory-bound, with each Gaussian primitive having a large attribute vector that quickly exceeds GPU capacity. Prior systems were limited to tens of millions of Gaussians on commodity single-GPU hardware.

The authors observe that 3D Gaussian Splatting training is inherently sparse and trajectory-conditioned, meaning that each iteration only activates the Gaussians visible from the current camera batch. This insight allows the authors to manage parameters across an SSD-CPU-GPU hierarchy using three techniques: block-virtualized geometry for spatial locality, a hierarchical asynchronous pipeline to overlap I/O with computation, and trajectory-adaptive differential streaming that transfers only incremental working-set deltas between iterations.

The TideGS framework enables training with over one billion Gaussians on a single 24 GB GPU, achieving the best reconstruction quality among evaluated single-GPU baselines on large-scale scenes. This is a significant improvement over prior out-of-core baselines, which were limited to approximately 100 million Gaussians, and standard in-memory training, which was limited to approximately 11 million Gaussians. The results demonstrate that TideGS can scale beyond prior systems, making it a promising solution for large-scale 3D Gaussian Splatting applications.


📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20150
• PDF: https://arxiv.org/pdf/2605.20150
• Project Page: https://sponge-lab.github.io/TideGS/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DGaussianSplatting #ScalableDeepLearning #OutofCoreOptimization #GPUAcceleration #ComputerVisionTechniques
AI & ML Papers
Photo
🔥 Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning

💡 The paper introduces Uni-Edit, a novel intelligent image editing task designed to enhance unified multimodal models' understanding, generation, and editing capabilities. Currently, these models are trained using complex multi-stage pipelines and mixed multi-task training, which can lead to performance trade-offs rather than mutual reinforcement. To address this issue, Uni-Edit proposes a single task, single training stage, and single dataset approach. The authors identify image editing as an ideal general task that naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions, which underutilize a model's understanding capacity.

To overcome this limitation, the authors develop an automated and scalable data synthesis pipeline that transforms diverse visual question answering data into complex and effective editing instructions with embedded questions and nested logic. This pipeline yields Uni-Edit-148k, a dataset pairing diverse reasoning-intensive instructions with high-quality edited images. The authors conduct extensive experiments on two models, BAGEL and Janus-Pro, and demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations. The results show that Uni-Edit is a general task that can unify and improve the performance of unified multimodal models, making it a valuable contribution to the field of data science and artificial intelligence.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21487
• PDF: https://arxiv.org/pdf/2605.21487
• Project Page: https://zhengdian1.github.io/Uni-Edit-proj/

🤖 Models citing this paper:
https://huggingface.co/Uni-Edit/Uni-Edit-BAGEL

📊 Datasets citing this paper:
https://huggingface.co/datasets/Uni-Edit/Train-Data

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#IntelligentImageEditing #UnifiedMultimodalModels #ImageEditingTasks #MultimodalModelTuning #MultitaskLearningApproaches
2
AI & ML Papers
Photo
🔥 PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

💡 The paper introduces PEEK, a system designed to improve the performance of large language model agents operating over long and recurring external contexts, such as document corpora and code repositories. The problem with existing approaches is that they do not preserve reusable orientation knowledge about the recurring context itself, which includes information about what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful.

To address this issue, PEEK uses a context map, a small and constant-sized artifact in the agent's prompt, to cache and maintain this orientation knowledge. The context map is maintained by a programmable cache policy consisting of three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget.

The results show that PEEK improves over strong baselines in long-context reasoning and information aggregation tasks by 6.3-34.0 percent, while using 93-145 fewer iterations and incurring 1.7-5.8 times lower cost than the state-of-the-art prompt-learning framework, ACE. Additionally, PEEK improves solving rate and rubric accuracy in context learning tasks by 6.0-14.0 percent and 7.8-12.1 percent, respectively, at 1.4 times lower cost than ACE. These gains generalize across different language models and agent architectures, including OpenAI Codex, a production-grade coding agent. Overall, the paper demonstrates that using a context map helps long-context language model agents interact with recurring external contexts more accurately and efficiently.


📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.19932
• PDF: https://arxiv.org/pdf/2605.19932
• Project Page: https://zhuohangu.github.io/blog-post-peek/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongContextLLM #ContextMap #OrientationCache #LargeLanguageModelAgents #RecurringContextProcessing
2
AI & ML Papers
Photo
🔥 OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

💡 The paper introduces OScaR, a novel framework for compressing Key-Value caches in large language models, which is a major memory bottleneck for efficient deployment. The existing per-channel quantization method is limited by Token Norm Imbalance, where errors are amplified when quantization parameters are shared across tokens with different norms. To address this, OScaR uses Canalized Rotation and Omni-Token Scaling to reduce the impact of Token Norm Imbalance, resulting in a more accurate and efficient compression framework.

The method works by first applying Canalized Rotation to mitigate the sequence-dimensional variance caused by Token Norm Imbalance, and then applying Omni-Token Scaling to further reduce the errors. This approach is supported by an optimized system design and CUDA kernels, making it a lightweight and efficient solution.

The paper evaluates OScaR on various large language models, including text-only, multi-modal, and omni-modal models, and shows that it consistently outperforms existing methods. The results demonstrate that OScaR achieves near-lossless performance under INT2 quantization, and provides a significant improvement in memory efficiency and decoding speed. Compared to the baseline, OScaR achieves a 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available, making it a robust, low-complexity, and universal framework for KV cache compression. Overall, the paper contributes a new approach to addressing the memory bottleneck in large language models, and provides a significant improvement in efficiency and performance.


📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.19660
• PDF: https://arxiv.org/pdf/2605.19660
• Project Page: https://iridescent-gcrace.github.io/OScaR/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LLMCompression #KeyValueCacheQuantization #ExtremeQuantizationTechniques #TokenNormImbalance #EfficientLLMDeployment
2
AI & ML Papers
Photo
🔥 Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

💡 The paper addresses the problem of robust speech recognition in real-world environments, where current models often struggle with acoustic distortions, producing omissions or hallucinations. This issue is referred to as the acoustic robustness bottleneck. To overcome this, the authors propose the Mega-ASR framework, which combines compound-data construction with progressive acoustic-to-semantic optimization techniques.

The Mega-ASR framework uses a new dataset called Voices-in-the-Wild-2M, which covers 7 classic acoustic phenomena and 54 physically plausible compound scenarios. The authors train Mega-ASR using two techniques: Acoustic-to-Semantic Progressive Supervised Fine-Tuning and Dual-Granularity WER-Gated Policy Optimization.

The results show that Mega-ASR significantly outperforms prior state-of-the-art systems on adverse-condition ASR benchmarks, with a word error rate of 45.69 percent on the VOiCES R4-B-F benchmark and 21.49 percent on the NOIZEUS Sta-0 benchmark. Additionally, Mega-ASR achieves over 30 percent relative word error rate reduction on complex compositional acoustic scenarios compared to strong open- and closed-source baselines.

Overall, the paper presents a scalable paradigm for robust speech recognition in real-world environments, addressing the acoustic robustness bottleneck and achieving significant improvements over prior systems. The Mega-ASR framework has the potential to improve speech recognition in a wide range of applications, from voice assistants to transcription services.


📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.19833
• PDF: https://arxiv.org/pdf/2605.19833
• Project Page: https://xzf-thu.github.io/Mega-ASR/

🤖 Models citing this paper:
https://huggingface.co/zhifeixie/Mega-ASR

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Reza2kn/mega-asr-bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpeechRecognitionTechniques #AcousticRobustnessInASR #RealWorldSpeechProcessing #AcousticSimulationMethods #RobustASRSystems
AI & ML Papers
Photo
🔥 Stable Audio 3

💡 The paper introduces Stable Audio 3, a family of fast latent diffusion models for variable-length audio generation and editing. The problem addressed is the inefficiency of generating full-length audio for short sounds, which can be costly. To solve this, the authors propose a method that uses latent diffusion models operating on a novel semantic-acoustic autoencoder, which projects audio into a compact latent space. This enables efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent space. The models also support inpainting, allowing for targeted audio editing and continuation of short recordings.

The method involves training the latent diffusion models on a dataset of licensed and Creative Commons data, and then running adversarial post-training to accelerate inference and improve generation quality. This reduces the number of inference steps while improving fidelity and prompt adherence.

The results show that the Stable Audio 3 models can generate music and sounds in less than 2 seconds on an H200 GPU and less than a few seconds on a MacBook Pro M4. The authors release the weights of the small and medium models, which can run on consumer-grade hardware, along with their training and inference pipeline. This allows for efficient and high-quality audio generation and editing, making it possible to generate several minutes of audio while avoiding the cost of producing full-length generations for short sounds. Overall, the paper contributes to the development of efficient and high-quality audio generation and editing methods, with potential applications in music and sound design.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.17991
• PDF: https://arxiv.org/pdf/2605.17991
• Project Page: https://stability.ai/news-updates/meet-stable-audio-3-the-model-family-built-for-artistic-experimentation-with-open-weight-models

🤖 Models citing this paper:
https://huggingface.co/stabilityai/stable-audio-3-medium
https://huggingface.co/stabilityai/stable-audio-3-small-music
https://huggingface.co/stabilityai/stable-audio-3-small-sfx

🚀 Spaces citing this paper:
https://huggingface.co/spaces/stabilityai/stable-audio-3
https://huggingface.co/spaces/owenisas/stable-audio-3-lab

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LatentDiffusionModels #AudioGeneration #VariableLengthAudio #SemanticAcousticAutoencoder #DiffusionBasedAudioEditing
AI & ML Papers
Photo
🔥 HRM-Text: Efficient Pretraining Beyond Scaling

💡 The current approach to training large language models requires massive computational power and large amounts of raw text, creating a significant barrier to research. Inspired by the efficient learning processes of biological systems, the authors propose a new approach called HRM-Text, which uses a Hierarchical Recurrent Model architecture. This architecture decouples computation into two layers, a slow-evolving strategic layer and a fast-evolving execution layer, allowing for more efficient processing. To stabilize this model, the authors introduce two new techniques, MagicNorm and warmup deep credit assignment.

Instead of training on raw text, HRM-Text is trained exclusively on instruction-response pairs using a task-completion objective. The model is also trained with PrefixLM masking, which helps to improve its performance. The results show that a 1 billion parameter HRM-Text model, trained from scratch on only 40 billion unique tokens and with a budget of 1500 dollars, achieves competitive performance on several benchmarks, including MMLU, ARC-C, DROP, GSM8K, and MATH.

Notably, HRM-Text achieves this performance while utilizing significantly fewer training tokens and less estimated compute than standard baselines. Specifically, it uses 100-900 times fewer training tokens and 96-432 times less estimated compute. This demonstrates that co-designing architectures and objectives can radically reduce the compute-to-performance ratio, making it possible to train large language models from scratch with limited resources. The authors' approach makes pretraining more accessible to the broader research community, which could lead to further advancements in the field of natural language processing.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20613
• PDF: https://arxiv.org/pdf/2605.20613
• Project Page: https://github.com/sapientinc/HRM-Text

🤖 Models citing this paper:
https://huggingface.co/sapientinc/HRM-Text-1B

📊 Datasets citing this paper:
https://huggingface.co/datasets/sapientinc/HRM-Text-data-io-cleaned-20260515

🚀 Spaces citing this paper:
https://huggingface.co/spaces/nikravan/HRM-Text-1B
https://huggingface.co/spaces/Bhaddy392/GPT_AI
https://huggingface.co/spaces/bunnycore/HRM-Text-1B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#HierarchicalRecurrentModels #EfficientPretrainingMethods #LargeLanguageModelOptimization #InstructionResponsePairLearning #NeuralArchitectureInnovation
AI & ML Papers
Photo
🔥 MemOS: A Memory OS for AI System

💡 The paper introduces MemOS, a memory operating system designed for Large Language Models to address the challenges of memory management. Current models lack a well-defined memory management system, relying on static parameters and short-lived contextual states, which limits their ability to track user preferences or update knowledge over time. The proposed MemOS system unifies plaintext, activation-based, and parameter-level memories, enabling efficient storage, retrieval, and continual learning.

The key contribution of MemOS is the introduction of a basic unit called a MemCube, which encapsulates both memory content and metadata such as provenance and versioning. MemCubes can be composed, migrated, and fused over time, allowing for flexible transitions between memory types and bridging retrieval with parameter-based learning.

By treating memory as a manageable system resource, MemOS establishes a memory-centric system framework that brings controllability, plasticity, and evolvability to Large Language Models. This framework enables cost-efficient storage and retrieval, laying the foundation for continual learning and personalized modeling. The proposed system has the potential to address the broader challenges of managing heterogeneous knowledge spanning different temporal scales and sources, and can substantially reduce the training and inference costs of Large Language Models.

Overall, the paper proposes a novel approach to memory management for Large Language Models, which can improve their ability to learn and adapt over time, and can pave the way for the development of more advanced Artificial General Intelligence systems. The results of the paper demonstrate the effectiveness of the proposed MemOS system in addressing the challenges of memory management in Large Language Models, and highlight its potential to enable more efficient and effective learning and adaptation in these models.


📅 Published on Jul 4, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.03724
• PDF: https://arxiv.org/pdf/2507.03724
• Project Page: https://memos.openmem.net/

🤖 Models citing this paper:
https://huggingface.co/kagvi13/HMP

📊 Datasets citing this paper:
https://huggingface.co/datasets/MemTensor/MemOS_eval_result

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MemoryOperatingSystem #LargeLanguageModels #MemoryManagementSystems #ContinualLearningAlgorithms #ArtificialIntelligenceArchitecture
AI & ML Papers
Photo
🔥 AutoResearchClaw: Self-Reinforcing Autonomous Research with Human-AI Collaboration

💡 AutoResearchClaw is a new autonomous research system that improves scientific discovery by incorporating human collaboration and iterative learning. The problem with existing autonomous research systems is that they often model the research process as a linear pipeline, relying on single agent reasoning and stopping when execution fails, without carrying experience across runs.

The authors of AutoResearchClaw address this issue by introducing a multi agent autonomous research pipeline built on five key mechanisms. The first mechanism is structured multi agent debate for hypothesis generation and result analysis, which allows for multiple perspectives to be considered. The second mechanism is a self healing executor with a pivot refine decision loop that transforms failures into information, enabling the system to learn from its mistakes. The third mechanism is verifiable result reporting that prevents fabricated numbers and hallucinated citations, ensuring the accuracy of the results. The fourth mechanism is human in the loop collaboration with seven intervention modes, allowing for varying levels of human oversight and collaboration. The fifth mechanism is cross run evolution that converts past mistakes into future safeguards, enabling the system to improve over time.

The results of AutoResearchClaw are impressive, outperforming a previous system called AI Scientist v2 by 54.7 percent on a 25 topic experiment stage benchmark. The authors also conducted a human in the loop ablation study, which revealed that precise targeted collaboration at high leverage decision points consistently outperforms both full autonomy and exhaustive step by step oversight. Overall, AutoResearchClaw is positioned as a research amplifier that augments rather than replaces human scientific judgment, and its code is available for further development and use.


📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20025
• PDF: https://arxiv.org/pdf/2605.20025
• Project Page: https://github.com/aiming-lab/AutoResearchClaw

📊 Datasets citing this paper:
https://huggingface.co/datasets/AIMING-Lab-UNC/ARC-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AutonomousResearchSystems #HumanAICollaboration #MultiAgentLearning #ArtificialIntelligenceInScience #SelfReinforcingSystems
1