AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.7K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 InterleaveThinker: Reinforcing Agentic Interleaved Generation

💡 The paper introduces InterleaveThinker, a multi-agent pipeline that enables existing image generators to perform interleaved generation, which involves generating a sequence of text and images. The current state-of-the-art image generators are limited in their ability to perform interleaved generation due to their architectures. InterleaveThinker addresses this limitation by using a planner agent to organize the input sequence and instruct the image generator, and a critic agent to evaluate the generator's outputs and refine the instructions. The pipeline is implemented using two modules, Interleave-Planner-SFT-80k and Interleave-Critic-SFT-112k, which are trained to perform a format cold-start, and Interleave-Critic-RL-13k, which is trained using reinforcement learning to correct instructions within a generation trajectory.

The method involves using a planner agent to plan the execution of the image generator at each step, and a critic agent to evaluate the generator's outputs and identify samples that deviate from the planned instructions. The critic agent then refines the instructions for regeneration. To optimize the entire generation trajectory, the authors propose using accuracy reward and step-wise reward, which allows single-step reinforcement learning to guide the entire trajectory.

The results show that InterleaveThinker improves the performance of various image generators on interleaved generation benchmarks, achieving performance comparable to state-of-the-art models such as Nano Banana and GPT-5. Additionally, InterleaveThinker significantly enhances the base model on reasoning-based benchmarks, such as 4-step FLUX.2-klein, where it achieves substantial gains on WISE and RISE. Overall, the paper demonstrates the effectiveness of InterleaveThinker in enabling existing image generators to perform interleaved generation and improving their performance on various benchmarks.


📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13679
• PDF: https://arxiv.org/pdf/2606.13679
• Project Page: https://zhengdian1.github.io/InterleaveThinker-proj/

🤖 Models citing this paper:
https://huggingface.co/InterleaveThinker/InterleaveThinker-Planner-8B
https://huggingface.co/InterleaveThinker/InterleaveThinker-Critic-8B
https://huggingface.co/InterleaveThinker/Critic-SFT-8B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#InterleavedGeneration #AgenticInterleaving #MultiAgentPipelines #ImageTextGeneration #ReinforcedGeneration
AI & ML Papers
Photo
🔥 Robust-U1: Can MLLMs Self-Recover Corrupted Visual Content for Robust Understanding?

💡 The paper proposes a novel framework called Robust-U1 to enhance the robustness of multimodal large language models against visual corruptions. The problem addressed is that existing models perform poorly when faced with real-world visual corruptions such as noise or blur. Current approaches to improve robustness have limitations, either lacking interpretability or being unable to restore lost pixel-level details.

The Robust-U1 framework is designed to equip models with explicit visual self-recovery capability, allowing them to recover corrupted visual content by themselves. The approach consists of three stages: supervised fine-tuning for initial reconstruction, reinforcement learning with dual rewards to align high visual quality, and multimodal reasoning that considers both the corrupted input and the recovered image.

The results show that Robust-U1 achieves state-of-the-art robustness on a real-world corruption benchmark and maintains superior performance under adversarial corruptions on general visual question answering benchmarks. The analysis confirms that high-quality visual recovery directly enhances reasoning performance, establishing self-recovery as a critical mechanism for robust visual understanding. Overall, the paper demonstrates that multimodal large language models can self-recover corrupted visual content, leading to improved robustness and performance in visual understanding tasks.


📅 Published on Jun 6

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.08063
• PDF: https://arxiv.org/pdf/2606.08063
• Project Page: https://huggingface.co/spaces/Jiaqi-hkust/Robust-U1

🤖 Models citing this paper:
https://huggingface.co/Jiaqi-hkust/Robust-U1-SFT
https://huggingface.co/Jiaqi-hkust/Robust-U1-RL
https://huggingface.co/Jiaqi-hkust/Robust-U1

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Jiaqi-hkust/Robust-U1

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisualContentRecovery #RobustLanguageModels #SelfRecoveryMechanisms #CorruptionResistantAI
AI & ML Papers
Photo
🔥 MiniMax Sparse Attention

💡 The paper introduces MiniMax Sparse Attention, a method for efficient processing of ultra-long contexts in large language models. The problem addressed is that the quadratic cost of softmax attention makes it difficult to jointly attend over hundreds of thousands to millions of tokens, which is necessary for applications such as agentic workflows, repository-scale code reasoning, and persistent memory.

To address this problem, the authors propose a blockwise sparse attention built upon Grouped Query Attention, called MiniMax Sparse Attention. This method consists of two branches: a lightweight Index Branch that scores key-value blocks and selects a Top-k subset for each group, and a Main Branch that performs exact block-sparse attention over only the selected blocks.

The method is designed to be simple and scalable, making it easy to deploy efficiently across a range of GPUs. The authors also co-design a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access.

The results show that MiniMax Sparse Attention performs on par with Grouped Query Attention while reducing per-token attention compute by 28.4x at 1M context. When paired with the co-designed kernel, it achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. The authors also release a production-grade natively multimodal model powered by MiniMax Sparse Attention, as well as the inference kernel, making it available for use by others.

Overall, the paper contributes a new method for efficient processing of ultra-long contexts in large language models, which has significant implications for applications that require joint attention over large numbers of tokens. The method is designed to be efficient, scalable, and easy to deploy, making it a valuable contribution to the field of natural language processing.


📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13392
• PDF: https://arxiv.org/pdf/2606.13392

🤖 Models citing this paper:
https://huggingface.co/MiniMaxAI/MiniMax-M3
https://huggingface.co/MiniMaxAI/MiniMax-M3-MXFP8
https://huggingface.co/sparkarena/Minimax-M3-v0-NVFP4

🚀 Spaces citing this paper:
https://huggingface.co/spaces/saivivek6/updated_mongodb_p
https://huggingface.co/spaces/akhaliq/MiniMax-M3

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MiniMaxSparseAttention #EfficientLanguageModeling #SparseAttentionMechanisms #UltraLongContextProcessing #BlockwiseAttentionMethods
1
AI & ML Papers
Photo
🔥 SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

💡 The paper introduces SpatialClaw, a training-free framework that enables flexible and stateful spatial reasoning in vision-language models. The problem addressed is the limitation of current spatial agents in performing open-ended spatial reasoning tasks, which is due to the design of the action interface that invokes specialist perception modules. Existing spatial agents use either single-pass code execution or a structured tool-call interface, both of which offer limited flexibility for complex 3D/4D spatial reasoning.

The proposed SpatialClaw framework uses code as the action interface, allowing a vision-language model-backed agent to write executable code conditioned on prior outputs. This approach enables the agent to flexibly compose and manipulate perception results and adapt its analysis to intermediate text and visual observations. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives.

The results show that SpatialClaw achieves superior performance across diverse 3D/4D spatial reasoning tasks, with an average accuracy of 59.9% across 20 benchmarks. This represents a significant improvement of 11.2 points over the recent spatial agent, with consistent gains across six vision-language model backbones from two model families, without any benchmark- or model-specific adaptation. The paper's contribution is the introduction of a flexible and effective framework for spatial reasoning that can be applied to a wide range of tasks without requiring training or adaptation.


📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13673
• PDF: https://arxiv.org/pdf/2606.13673
• Project Page: https://spatialclaw.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpatialReasoning #VisionLanguageModels #AgenticInterfaces #SpatialArtificialIntelligence #CodeBasedActionInterfaces
🔥 EurekAgent: Agent Environment Engineering is All You Need For Autonomous Scientific Discovery

💡 The paper introduces EurekAgent, a system designed to enhance autonomous scientific discovery through environment engineering. The authors argue that as model capabilities improve, the main bottleneck for autonomous scientific discovery shifts from designing agent workflows to designing agent environments. Environment engineering involves building environments that promote productive behaviors such as exploration, collaboration, and artifact management, while suppressing harmful behaviors like reward hacking and high-friction human oversight.

The EurekAgent system engineers the environment along four dimensions: permissions engineering, artifact engineering, budget engineering, and human-in-the-loop engineering. Permissions engineering allows for bounded agent execution and isolated evaluation, while artifact engineering enables filesystem and Git-based collaboration. Budget engineering enables budget-aware exploration, and human-in-the-loop engineering facilitates easy human supervision and intervention.

The authors demonstrate the effectiveness of EurekAgent by achieving state-of-the-art results across multiple domains, including mathematics, kernel engineering, and machine learning tasks. Notably, EurekAgent discovered new state-of-the-art 26-circle packing results with less than 11 dollars in total API cost. The system's low computational costs and impressive results highlight the potential of environment engineering for autonomous scientific discovery.

The paper's main contribution is the introduction of environment engineering as a core research direction for developing reliable autonomous research agents. By open-sourcing their code and results, the authors invite the research community to explore and build upon their work, with the goal of advancing autonomous scientific discovery. Overall, the paper presents a significant step forward in the development of autonomous scientific discovery systems, and highlights the importance of environment engineering in achieving reliable and efficient autonomous research.


📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13662
• PDF: https://arxiv.org/pdf/2606.13662

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AutonomousScientificDiscovery #AgentEnvironmentEngineering #EnvironmentEngineering #ArtificialIntelligenceForScience #AutonomousResearchSystems
AI & ML Papers
Photo
🔥 RepWAM: World Action Modeling with Representation Visual-Action Tokenizers

💡 The paper introduces RepWAM, a representation-centric world action model that improves robot manipulation performance through language-guided future state prediction and action modeling. The problem with existing world action models is that they use reconstruction-oriented video tokenizers that prioritize visual fidelity over instruction-following dynamics, limiting their ability to connect future prediction with robot control. To address this, the authors propose a semantic visual-action latent space that maps visual inputs into aligned visual and latent action tokens. They train a representation visual-action tokenizer and pretrain their world action model to jointly model future visual states and latent actions under language instructions. The model is then adapted to real robot trajectories for closed-loop manipulation. The results show that RepWAM delivers strong performance across diverse manipulation settings, outperforming reconstruction-oriented alternatives. The authors highlight the value of semantic visual-action tokenization as a promising foundation for world action models and a step toward generalist robot policies. The code and weights for RepWAM will be made available, allowing for further development and application of this technology. Overall, the paper contributes a new approach to world action modeling that prioritizes instruction-following dynamics and semantic understanding, leading to improved robot manipulation performance.


📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13674
• PDF: https://arxiv.org/pdf/2606.13674
• Project Page: https://wdrink.github.io/RepWAM/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RobotManipulation #WorldActionModeling #VisualActionTokenizers #LanguageGuidedControl #FutureStatePrediction
AI & ML Papers
Photo
🔥 LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

💡 The paper presents LMCache, an efficient key-value cache layer for large language model inference at the enterprise scale. The problem addressed is the traditional storage of key-value caches in GPU memory, which limits cache reuse across different queries and inference engines. As the total key-value cache stored by users grows rapidly, exceeding the capacity of GPU memory, there is a need to move caches outside GPU devices.

The authors propose LMCache as a solution, which extracts and stores key-value caches generated by modern large language model engines out of the GPU memory and shares them across engines and queries. LMCache supports cache offloading and prefill-decode disaggregation, allowing for cross-engine and GPU cache transfer. The key contributions of LMCache include highly optimized key-value cache data movement, a modular cache connector component that decouples LMCache from the evolution of inference engines, and a control API for flexible cache orchestration across different layers.

The evaluation of LMCache shows significant improvements in throughput, with up to 15 times improvement when combined with a large language model engine. The adoption of LMCache in enterprise settings provides valuable insights, such as the benefits of fetching key-value caches from remote storage and the impact of context truncation on prefix cache hit ratio. Overall, LMCache is presented as an efficient and open-source key-value caching solution that addresses the need for efficient cache management in large language model inference.


📅 Published on Oct 8, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.09665
• PDF: https://arxiv.org/pdf/2510.09665
• Project Page: https://huggingface.co/collections/dvps/dvps-scientific-watch

🤖 Models citing this paper:
https://huggingface.co/enfinity7B/apac

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #LLMInference #KVCacheOptimization #EnterpriseScaleAI #GPUAcceleratedInference
2
AI & ML Papers
Photo
🔥 Foundations of Large Language Models

💡 The book Foundations of Large Language Models provides a comprehensive overview of the fundamental concepts underlying large language models. The book is structured into four main chapters, each focusing on a key area: pre-training, generative models, prompting techniques, and alignment methods. The authors aim to provide a foundational understanding of large language models, rather than a comprehensive coverage of all cutting-edge technologies. The book is intended for college students, professionals, and practitioners in natural language processing and related fields, serving as a reference for anyone interested in large language models.

The problem addressed by the book is the need for a clear understanding of the foundational concepts of large language models, which are becoming increasingly important in natural language processing. The method used to address this problem is a structured approach, dividing the topic into four key areas and exploring each in depth. The results of this approach are a book that provides a solid foundation for understanding large language models, which can be used as a reference by students, professionals, and practitioners in the field.

Overall, the book provides a foundational understanding of large language models, covering key areas such as pre-training, generative models, prompting techniques, and alignment methods, and is intended to serve as a reference for those interested in this topic. The book does not aim to cover all cutting-edge technologies, but rather provides a solid foundation for understanding the underlying concepts of large language models.


📅 Published on Jan 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2501.09223
• PDF: https://arxiv.org/pdf/2501.09223

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #NaturalLanguageProcessing #PreTrainingMethods #GenerativeModels #LanguageModelAlignment
1
AI & ML Papers
Photo
🔥 Fara-7B: An Efficient Agentic Model for Computer Use

💡 The paper introduces FaraGen, a synthetic data generation system for computer use agents, which addresses the lack of large and high-quality datasets for training efficient models. The absence of such datasets has limited the progress of computer use agents, unlike large language models that have benefited from abundant textual data. FaraGen generates diverse tasks from frequently used websites, produces multiple solution attempts, and filters successful trajectories using multiple verifiers, achieving high throughput, yield, and diversity for multi-step web tasks at a low cost.

Using the data generated by FaraGen, the authors train Fara-7B, a native computer use agent model that perceives the computer using only screenshots and executes actions via predicted coordinates. Fara-7B is small enough to run on-device, making it efficient for practical applications. The model is evaluated on several benchmarks, including WebVoyager, Online-Mind2Web, and the newly introduced WebTailBench, which better captures under-represented web tasks.

The results show that Fara-7B outperforms other computer use agent models of comparable size on these benchmarks. Moreover, Fara-7B is competitive with much larger models, demonstrating the benefits of scalable data generation systems in advancing small and efficient agentic models. The authors are making Fara-7B available as open-source, along with the WebTailBench benchmark, to facilitate further research and development in the field of computer use agents. Overall, the paper contributes to the advancement of efficient and high-performing computer use agents by introducing a novel data generation system and a state-of-the-art model that can be used for a wide range of web tasks.


📅 Published on Nov 24, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.19663
• PDF: https://arxiv.org/pdf/2511.19663
• Project Page: https://aka.ms/msaif/fara

🤖 Models citing this paper:
https://huggingface.co/microsoft/Fara-7B
https://huggingface.co/AlexKitipov/Fara-7B
https://huggingface.co/XythicK/microsoft_Fara-7B-GGUF

📊 Datasets citing this paper:
https://huggingface.co/datasets/microsoft/WebTailBench
https://huggingface.co/datasets/Archi-001/WebTailBench

🚀 Spaces citing this paper:
https://huggingface.co/spaces/2025-ai-timeline/2025-ai-timeline
https://huggingface.co/spaces/prithivMLmods/CUA-GUI-Operator
https://huggingface.co/spaces/HyperCluster/Fara-BrowserUse

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerUseAgents #SyntheticDataGeneration #AgenticModels #WebTaskAutomation #EfficientModelTraining
1
Please open Telegram to view this post
VIEW IN TELEGRAM
1
AI & ML Papers
Photo
🔥 OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

💡 The paper introduces a new dataset and method for improving audio-visual question answering systems. Current systems typically process videos in short clips and generate separate descriptions for audio and visual modalities, which can lead to inconsistent descriptions and a lack of cross-modal reasoning. To address this, the authors propose a two-part approach: entity-anchored video scripting, which transforms videos into structured scripts with summaries, main entity lists, and segment-wise audio-visual descriptions, and clue-guided QA generation, which prompts models to mine cross-segment clues from the script and generate QA pairs based on these clues.

The entity-anchored video scripting mechanism ensures cross-segment referential consistency and reconstructs audio-visual associations, while the clue-guided QA generation mechanism encourages models to generate questions that require long-term temporal connections and deep cross-modal reasoning. The authors use this pipeline to construct a new dataset called OmniVideo-100K, which consists of structured scripts and QA pairs, as well as a human-verified test set called OmniVideo-Test.

The results show that fine-tuning models on OmniVideo-100K yields significant performance gains, with improvements of up to 20.59% on the OmniVideo-Test set. The models also demonstrate strong generalization, with improvements of up to 12.64% on established benchmarks such as Daily-Omni and JointAVBench. Overall, the paper contributes a new dataset and method for improving audio-visual question answering systems, with a focus on cross-modal reasoning and temporal consistency.


📅 Published on Jun 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14702
• PDF: https://arxiv.org/pdf/2606.14702
• Project Page: https://yzlmhzz.github.io/OmniVideo-100K/

📊 Datasets citing this paper:
https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K
https://huggingface.co/datasets/MiG-NJU/OmniVideo-Test

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualReasoning #MultimodalLearning #VideoUnderstanding #CrossModalReasoning #AudioVisualQuestionAnswering
1