AI & ML Papers

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

742 views19:54

763 views19:54

828 views05:54

🔥 LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

💡 The paper presents LMCache, an efficient key-value cache layer for large language model inference at the enterprise scale. The problem addressed is the traditional storage of key-value caches in GPU memory, which limits cache reuse across different queries and inference engines. As the total key-value cache stored by users grows rapidly, exceeding the capacity of GPU memory, there is a need to move caches outside GPU devices.

The authors propose LMCache as a solution, which extracts and stores key-value caches generated by modern large language model engines out of the GPU memory and shares them across engines and queries. LMCache supports cache offloading and prefill-decode disaggregation, allowing for cross-engine and GPU cache transfer. The key contributions of LMCache include highly optimized key-value cache data movement, a modular cache connector component that decouples LMCache from the evolution of inference engines, and a control API for flexible cache orchestration across different layers.

The evaluation of LMCache shows significant improvements in throughput, with up to 15 times improvement when combined with a large language model engine. The adoption of LMCache in enterprise settings provides valuable insights, such as the benefits of fetching key-value caches from remote storage and the impact of context truncation on prefix cache hit ratio. Overall, LMCache is presented as an efficient and open-source key-value caching solution that addresses the need for efficient cache management in large language model inference.

📅 Published on Oct 8, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.09665
• PDF: https://arxiv.org/pdf/2510.09665
• Project Page: https://huggingface.co/collections/dvps/dvps-scientific-watch

🤖 Models citing this paper:
• https://huggingface.co/enfinity7B/apac

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #LLMInference #KVCacheOptimization #EnterpriseScaleAI #GPUAcceleratedInference

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤2

1.05K views05:54

885 views20:19

🔥 Foundations of Large Language Models

💡 The book Foundations of Large Language Models provides a comprehensive overview of the fundamental concepts underlying large language models. The book is structured into four main chapters, each focusing on a key area: pre-training, generative models, prompting techniques, and alignment methods. The authors aim to provide a foundational understanding of large language models, rather than a comprehensive coverage of all cutting-edge technologies. The book is intended for college students, professionals, and practitioners in natural language processing and related fields, serving as a reference for anyone interested in large language models.

The problem addressed by the book is the need for a clear understanding of the foundational concepts of large language models, which are becoming increasingly important in natural language processing. The method used to address this problem is a structured approach, dividing the topic into four key areas and exploring each in depth. The results of this approach are a book that provides a solid foundation for understanding large language models, which can be used as a reference by students, professionals, and practitioners in the field.

Overall, the book provides a foundational understanding of large language models, covering key areas such as pre-training, generative models, prompting techniques, and alignment methods, and is intended to serve as a reference for those interested in this topic. The book does not aim to cover all cutting-edge technologies, but rather provides a solid foundation for understanding the underlying concepts of large language models.

📅 Published on Jan 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2501.09223
• PDF: https://arxiv.org/pdf/2501.09223

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #NaturalLanguageProcessing #PreTrainingMethods #GenerativeModels #LanguageModelAlignment

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤1

1.01K views20:19

687 views02:19

🔥 Fara-7B: An Efficient Agentic Model for Computer Use

💡 The paper introduces FaraGen, a synthetic data generation system for computer use agents, which addresses the lack of large and high-quality datasets for training efficient models. The absence of such datasets has limited the progress of computer use agents, unlike large language models that have benefited from abundant textual data. FaraGen generates diverse tasks from frequently used websites, produces multiple solution attempts, and filters successful trajectories using multiple verifiers, achieving high throughput, yield, and diversity for multi-step web tasks at a low cost.

Using the data generated by FaraGen, the authors train Fara-7B, a native computer use agent model that perceives the computer using only screenshots and executes actions via predicted coordinates. Fara-7B is small enough to run on-device, making it efficient for practical applications. The model is evaluated on several benchmarks, including WebVoyager, Online-Mind2Web, and the newly introduced WebTailBench, which better captures under-represented web tasks.

The results show that Fara-7B outperforms other computer use agent models of comparable size on these benchmarks. Moreover, Fara-7B is competitive with much larger models, demonstrating the benefits of scalable data generation systems in advancing small and efficient agentic models. The authors are making Fara-7B available as open-source, along with the WebTailBench benchmark, to facilitate further research and development in the field of computer use agents. Overall, the paper contributes to the advancement of efficient and high-performing computer use agents by introducing a novel data generation system and a state-of-the-art model that can be used for a wide range of web tasks.

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤2

893 views02:19

658 views12:19

🔥 OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

💡 The paper introduces a new dataset and method for improving audio-visual question answering systems. Current systems typically process videos in short clips and generate separate descriptions for audio and visual modalities, which can lead to inconsistent descriptions and a lack of cross-modal reasoning. To address this, the authors propose a two-part approach: entity-anchored video scripting, which transforms videos into structured scripts with summaries, main entity lists, and segment-wise audio-visual descriptions, and clue-guided QA generation, which prompts models to mine cross-segment clues from the script and generate QA pairs based on these clues.

The entity-anchored video scripting mechanism ensures cross-segment referential consistency and reconstructs audio-visual associations, while the clue-guided QA generation mechanism encourages models to generate questions that require long-term temporal connections and deep cross-modal reasoning. The authors use this pipeline to construct a new dataset called OmniVideo-100K, which consists of structured scripts and QA pairs, as well as a human-verified test set called OmniVideo-Test.

The results show that fine-tuning models on OmniVideo-100K yields significant performance gains, with improvements of up to 20.59% on the OmniVideo-Test set. The models also demonstrate strong generalization, with improvements of up to 12.64% on established benchmarks such as Daily-Omni and JointAVBench. Overall, the paper contributes a new dataset and method for improving audio-visual question answering systems, with a focus on cross-modal reasoning and temporal consistency.

📅 Published on Jun 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14702
• PDF: https://arxiv.org/pdf/2606.14702
• Project Page: https://yzlmhzz.github.io/OmniVideo-100K/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K
• https://huggingface.co/datasets/MiG-NJU/OmniVideo-Test

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualReasoning #MultimodalLearning #VideoUnderstanding #CrossModalReasoning #AudioVisualQuestionAnswering

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤1

711 views12:19

522 views22:19

🔥 Orchestra-o1: Omnimodal Agent Orchestration

💡 The paper presents Orchestra-o1, an omnimodal agent orchestration framework that enables efficient collaboration across multiple modalities such as text, image, audio, and video. The existing agent orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to complex settings where heterogeneous modalities coexist and interact. To address this limitation, Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources. The framework is trained using decision-aligned group relative policy optimization, an efficient agentic reinforcement learning approach. The results show that Orchestra-o1 achieves superior performance on complex multimodal benchmarks, surpassing the second-best approach by 10.3 percent accuracy on the OmniGAIA benchmark. Additionally, the trained Orchestra-o1-8B model achieves state-of-the-art performance against all existing open-source omnimodal agents, demonstrating the effectiveness of the proposed framework. Overall, the paper contributes to the development of omnimodal agent orchestration frameworks that can efficiently collaborate across multiple modalities, enabling the creation of more complex and powerful agent systems.

📅 Published on Jun 10

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13707
• PDF: https://arxiv.org/pdf/2606.13707

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#OmnimodalAgentOrchestration #MultimodalLearning #AgentCollaborationFrameworks #ModalityAwareTaskDecomposition #HeterogeneousModalitiesIntegration

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤1

591 views22:19

583 views22:19

🔥 Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

💡 The paper proposes a new framework called MRAgent that improves the ability of large language model agents to reason over long interaction histories. Current memory-augmented agents struggle with this task because they rely on a static retrieve-then-reason approach, which prevents them from dynamically adapting memory access to new evidence discovered during inference. To address this issue, MRAgent combines an associative memory graph with an active reconstruction mechanism. The memory graph represents information as a network of cues, tags, and contents, where tags serve as semantic bridges between cues and contents. The active reconstruction mechanism integrates language model reasoning directly into memory access, allowing the agent to iteratively explore and refine retrieval paths based on accumulated evidence. This approach enables the agent to dynamically adapt memory retrieval to the reasoning context, avoiding the need to consider all possible retrieval paths and reducing computational costs. The authors evaluate MRAgent on two benchmarks, LoCoMo and LongMemEval, and demonstrate significant improvements over strong baselines, with up to 23% better performance, while also reducing token and runtime costs. Overall, the paper contributes a new framework for long-horizon memory reasoning that is more efficient and effective than existing approaches.

📅 Published on Jun 4

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.06036
• PDF: https://arxiv.org/pdf/2606.06036

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GraphMemoryModels #LLMAgents #MemoryReconstruction #AssociativeMemoryGraphs #LongTermReasoningMechanisms

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤1

782 views22:19

666 views08:20

🔥 FastContext: Training Efficient Repository Explorer for Coding Agents

💡 The paper introduces FastContext, a dedicated exploration subagent designed to improve the efficiency of repository exploration in large language model coding agents. The problem addressed is that repository exploration is a major bottleneck in coding agents, consuming a substantial token budget and polluting the agent's context with irrelevant code snippets.

The method involves separating repository exploration from code solving using specialized exploration models. FastContext is invoked on demand and issues parallel tool calls to return concise file paths and line ranges as focused context. The exploration models used in FastContext are powered by 4B-30B parameters and are bootstrapped from strong reference-model trajectories. They are then refined with task-grounded rewards for broad first-turn search, multi-turn evidence gathering, and precise citation generation.

The results show that integrating FastContext into a coding agent improves end-to-end resolution rates by up to 5.5 percent while reducing coding-agent token consumption by up to 60 percent, with minimal overhead. The paper demonstrates that repository exploration can be effectively handled by specialized models, separate from the code solving process. The code and data for FastContext are made available, allowing for further research and development in this area. Overall, the paper presents a significant contribution to the field of coding agents and software engineering, providing a more efficient and effective approach to repository exploration.

📅 Published on Jun 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14066
• PDF: https://arxiv.org/pdf/2606.14066
• Project Page: https://huggingface.co/microsoft/FastContext-1.0-4B-SFT

🤖 Models citing this paper:
• https://huggingface.co/microsoft/FastContext-1.0-4B-SFT
• https://huggingface.co/microsoft/FastContext-1.0-4B-RL

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#EfficientRepositoryExploration #CodingAgents #LargeLanguageModels #RepositoryExplorationSubagents #SpecializedExplorationModels

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

❤1

792 views08:20

🔥 JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

💡 The paper introduces a new paradigm for vision-language models, shifting from turn-based systems that require user prompting to a model that operates in real-time, making autonomous decisions about when to respond or delegate. The problem with current large models is that they only answer when addressed and do not interact in real-time, even in video-call apps. To address this, the authors propose a model that continuously watches what is happening and decides on its own whether to speak or stay silent.

The authors make two main contributions. First, they release JoyAI-VL-Interaction, an 8B-scale vision-first vision-language interaction model that makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model. The model excels at vision-triggered responsiveness and time awareness. They also provide a transferable training recipe that allows for capabilities to emerge that were not explicitly trained for, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck.

Second, they release a complete deployable system built around the model, which streams any ongoing video into the model, making it genuinely present in the world. The system has pluggable components, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent.

The results show that human raters prefer JoyAI-VL-Interaction over in-app video-call assistants by a wide margin across six real-world scenarios. This is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system, making it a significant contribution to the field of interaction models.

📅 Published on Jun 10

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14777
• PDF: https://arxiv.org/pdf/2606.14777
• Project Page: https://joyai-vl-video-future-academy-jd.github.io/JoyAI-VL-Interaction/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #RealTimeInteraction #AutonomousDecisionMaking #VisionFirstApproach #MultimodalIntelligence

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

🔥1

538 views18:20

🔥 DreamX-World 1.0: A General-Purpose Interactive World Model

💡 DreamX-World 1.0 is a general-purpose interactive text-to-video model that generates long-horizon content with camera control and scene persistence. The problem addressed by this model is the need for a controllable and interactive world model that can generate high-quality video content. To solve this problem, the authors introduced several new methods, including a lightweight variant of projective positional encoding called E-PRoPE, which retains projective camera geometry while applying camera-aware attention to spatially reduced tokens.

The authors also converted a bidirectional video generator into a few-step autoregressive world model using causal forcing, DMD-style distillation, and long-rollout training. This training process exposes the model to its own generated history, reducing style and color drift that accumulates across autoregressive chunks. Additionally, the authors introduced Memory-Conditioned Scene Persistence, which retrieves earlier views through camera-geometry-based retrieval, and residual recycling, which makes the conditioning path less sensitive to imperfect memory latents.

The model also includes Event Instruction Tuning, which adds composable event control, and reinforcement learning alignment, which recovers camera control and visual quality after distillation. To improve efficiency, the authors used mixed-precision DiT execution, residual reuse, 75%-pruned VAE decoding, and asynchronous pipeline parallelism, allowing the model to reach up to 16 FPS on eight RTX 5090 GPUs.

The results show that DreamX-World 1.0 achieves a camera-control score of 73.75 and an overall score of 84.76, outperforming HY-WorldPlay 1.5 and LingBot-World in overall score. The model's ability to generate high-quality video content with camera control and scene persistence makes it a significant contribution to the field of interactive world models. Overall, DreamX-World 1.0 is a powerful tool for generating interactive and controllable video content, with potential applications in a variety of fields, including gaming, simulation, and education.

📅 Published on Jun 15

🔗 Links:
• GitHub: https://github.com/huggingface
• Project Page: https://huggingface.co/papers?q=projective%20positional%20encoding
• arXiv: https://arxiv.org/abs/2606.16993
• PDF: https://arxiv.org/pdf/2606.16993

🤖 Models citing this paper:
• https://huggingface.co/GD-ML/DreamX-World-5B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToVideoSynthesis #InteractiveWorldModels #VideoContentGeneration #ScenePersistence #CameraControlMechanisms

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

415 views18:20

367 views18:21

🔥 Geometric Action Model for Robot Policy Learning

💡 The paper proposes a Geometric Action Model for robot policy learning that leverages pretrained geometric foundation models to enable language-conditioned manipulation policies in 3D physical environments. The problem addressed is that current vision-language-action models and video world-action models operate primarily on 2D image frames or 2D-derived latent spaces, leaving implicit the 3D geometry required for contact-rich manipulation.

The proposed method, Geometric Action Model, repurposes a pretrained geometric foundation model as a shared substrate for perception, temporal prediction, and action decoding. It splits the model at an intermediate layer, using the shallow layers as an observation encoder and inserting a causal future predictor to forecast future latent tokens conditioned on language, proprioception, and action history. The predicted future tokens are then routed through the remaining model blocks for feature propagation and decoding, allowing a single backbone to produce both future geometry and actions.

The results show that the Geometric Action Model is more accurate, more robust, faster, and lighter than current foundation-model-scale baselines across a broad suite of simulation and real-robot manipulation benchmarks. This design equips the geometric foundation model with language-conditioned temporal world modeling through minimal architectural modification while preserving its rich geometric priors, making it a significant contribution to robot policy learning in 3D physical environments.

📅 Published on Jun 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.17046
• PDF: https://arxiv.org/pdf/2606.17046
• Project Page: https://cvlab-kaist.github.io/Geometric-Action-Model/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GeometricDeepLearning #RobotPolicyLearning #LanguageConditionedManipulation #3DPhysicalEnvironmentModeling #GeometricFoundationModels

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

489 views18:21