AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference

💡 The paper presents LMCache, an efficient key-value cache layer for large language model inference at the enterprise scale. The problem addressed is the traditional storage of key-value caches in GPU memory, which limits cache reuse across different queries and inference engines. As the total key-value cache stored by users grows rapidly, exceeding the capacity of GPU memory, there is a need to move caches outside GPU devices.

The authors propose LMCache as a solution, which extracts and stores key-value caches generated by modern large language model engines out of the GPU memory and shares them across engines and queries. LMCache supports cache offloading and prefill-decode disaggregation, allowing for cross-engine and GPU cache transfer. The key contributions of LMCache include highly optimized key-value cache data movement, a modular cache connector component that decouples LMCache from the evolution of inference engines, and a control API for flexible cache orchestration across different layers.

The evaluation of LMCache shows significant improvements in throughput, with up to 15 times improvement when combined with a large language model engine. The adoption of LMCache in enterprise settings provides valuable insights, such as the benefits of fetching key-value caches from remote storage and the impact of context truncation on prefix cache hit ratio. Overall, LMCache is presented as an efficient and open-source key-value caching solution that addresses the need for efficient cache management in large language model inference.


📅 Published on Oct 8, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.09665
• PDF: https://arxiv.org/pdf/2510.09665
• Project Page: https://huggingface.co/collections/dvps/dvps-scientific-watch

🤖 Models citing this paper:
https://huggingface.co/enfinity7B/apac

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #LLMInference #KVCacheOptimization #EnterpriseScaleAI #GPUAcceleratedInference
2
AI & ML Papers
Photo
🔥 Foundations of Large Language Models

💡 The book Foundations of Large Language Models provides a comprehensive overview of the fundamental concepts underlying large language models. The book is structured into four main chapters, each focusing on a key area: pre-training, generative models, prompting techniques, and alignment methods. The authors aim to provide a foundational understanding of large language models, rather than a comprehensive coverage of all cutting-edge technologies. The book is intended for college students, professionals, and practitioners in natural language processing and related fields, serving as a reference for anyone interested in large language models.

The problem addressed by the book is the need for a clear understanding of the foundational concepts of large language models, which are becoming increasingly important in natural language processing. The method used to address this problem is a structured approach, dividing the topic into four key areas and exploring each in depth. The results of this approach are a book that provides a solid foundation for understanding large language models, which can be used as a reference by students, professionals, and practitioners in the field.

Overall, the book provides a foundational understanding of large language models, covering key areas such as pre-training, generative models, prompting techniques, and alignment methods, and is intended to serve as a reference for those interested in this topic. The book does not aim to cover all cutting-edge technologies, but rather provides a solid foundation for understanding the underlying concepts of large language models.


📅 Published on Jan 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2501.09223
• PDF: https://arxiv.org/pdf/2501.09223

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #NaturalLanguageProcessing #PreTrainingMethods #GenerativeModels #LanguageModelAlignment
1
AI & ML Papers
Photo
🔥 Fara-7B: An Efficient Agentic Model for Computer Use

💡 The paper introduces FaraGen, a synthetic data generation system for computer use agents, which addresses the lack of large and high-quality datasets for training efficient models. The absence of such datasets has limited the progress of computer use agents, unlike large language models that have benefited from abundant textual data. FaraGen generates diverse tasks from frequently used websites, produces multiple solution attempts, and filters successful trajectories using multiple verifiers, achieving high throughput, yield, and diversity for multi-step web tasks at a low cost.

Using the data generated by FaraGen, the authors train Fara-7B, a native computer use agent model that perceives the computer using only screenshots and executes actions via predicted coordinates. Fara-7B is small enough to run on-device, making it efficient for practical applications. The model is evaluated on several benchmarks, including WebVoyager, Online-Mind2Web, and the newly introduced WebTailBench, which better captures under-represented web tasks.

The results show that Fara-7B outperforms other computer use agent models of comparable size on these benchmarks. Moreover, Fara-7B is competitive with much larger models, demonstrating the benefits of scalable data generation systems in advancing small and efficient agentic models. The authors are making Fara-7B available as open-source, along with the WebTailBench benchmark, to facilitate further research and development in the field of computer use agents. Overall, the paper contributes to the advancement of efficient and high-performing computer use agents by introducing a novel data generation system and a state-of-the-art model that can be used for a wide range of web tasks.


📅 Published on Nov 24, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.19663
• PDF: https://arxiv.org/pdf/2511.19663
• Project Page: https://aka.ms/msaif/fara

🤖 Models citing this paper:
https://huggingface.co/microsoft/Fara-7B
https://huggingface.co/AlexKitipov/Fara-7B
https://huggingface.co/XythicK/microsoft_Fara-7B-GGUF

📊 Datasets citing this paper:
https://huggingface.co/datasets/microsoft/WebTailBench
https://huggingface.co/datasets/Archi-001/WebTailBench

🚀 Spaces citing this paper:
https://huggingface.co/spaces/2025-ai-timeline/2025-ai-timeline
https://huggingface.co/spaces/prithivMLmods/CUA-GUI-Operator
https://huggingface.co/spaces/HyperCluster/Fara-BrowserUse

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerUseAgents #SyntheticDataGeneration #AgenticModels #WebTaskAutomation #EfficientModelTraining
1
Please open Telegram to view this post
VIEW IN TELEGRAM
1
AI & ML Papers
Photo
🔥 OmniVideo-100K: A Dataset for Audio-Visual Reasoning through Structured Scripts and Evidence Chains

💡 The paper introduces a new dataset and method for improving audio-visual question answering systems. Current systems typically process videos in short clips and generate separate descriptions for audio and visual modalities, which can lead to inconsistent descriptions and a lack of cross-modal reasoning. To address this, the authors propose a two-part approach: entity-anchored video scripting, which transforms videos into structured scripts with summaries, main entity lists, and segment-wise audio-visual descriptions, and clue-guided QA generation, which prompts models to mine cross-segment clues from the script and generate QA pairs based on these clues.

The entity-anchored video scripting mechanism ensures cross-segment referential consistency and reconstructs audio-visual associations, while the clue-guided QA generation mechanism encourages models to generate questions that require long-term temporal connections and deep cross-modal reasoning. The authors use this pipeline to construct a new dataset called OmniVideo-100K, which consists of structured scripts and QA pairs, as well as a human-verified test set called OmniVideo-Test.

The results show that fine-tuning models on OmniVideo-100K yields significant performance gains, with improvements of up to 20.59% on the OmniVideo-Test set. The models also demonstrate strong generalization, with improvements of up to 12.64% on established benchmarks such as Daily-Omni and JointAVBench. Overall, the paper contributes a new dataset and method for improving audio-visual question answering systems, with a focus on cross-modal reasoning and temporal consistency.


📅 Published on Jun 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14702
• PDF: https://arxiv.org/pdf/2606.14702
• Project Page: https://yzlmhzz.github.io/OmniVideo-100K/

📊 Datasets citing this paper:
https://huggingface.co/datasets/MiG-NJU/OmniVideo-100K
https://huggingface.co/datasets/MiG-NJU/OmniVideo-Test

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualReasoning #MultimodalLearning #VideoUnderstanding #CrossModalReasoning #AudioVisualQuestionAnswering
1
AI & ML Papers
Photo
🔥 Orchestra-o1: Omnimodal Agent Orchestration

💡 The paper presents Orchestra-o1, an omnimodal agent orchestration framework that enables efficient collaboration across multiple modalities such as text, image, audio, and video. The existing agent orchestration frameworks are limited to a narrow set of modalities and struggle to generalize to complex settings where heterogeneous modalities coexist and interact. To address this limitation, Orchestra-o1 introduces a unified orchestration mechanism that enables modality-aware task decomposition, online sub-agent specialization, and parallel sub-task execution. This allows agent systems to effectively tackle complex real-world tasks involving heterogeneous information sources. The framework is trained using decision-aligned group relative policy optimization, an efficient agentic reinforcement learning approach. The results show that Orchestra-o1 achieves superior performance on complex multimodal benchmarks, surpassing the second-best approach by 10.3 percent accuracy on the OmniGAIA benchmark. Additionally, the trained Orchestra-o1-8B model achieves state-of-the-art performance against all existing open-source omnimodal agents, demonstrating the effectiveness of the proposed framework. Overall, the paper contributes to the development of omnimodal agent orchestration frameworks that can efficiently collaborate across multiple modalities, enabling the creation of more complex and powerful agent systems.


📅 Published on Jun 10

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13707
• PDF: https://arxiv.org/pdf/2606.13707

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#OmnimodalAgentOrchestration #MultimodalLearning #AgentCollaborationFrameworks #ModalityAwareTaskDecomposition #HeterogeneousModalitiesIntegration
AI & ML Papers
Photo
🔥 Memory is Reconstructed, Not Retrieved: Graph Memory for LLM Agents

💡 The paper proposes a new framework called MRAgent that improves the ability of large language model agents to reason over long interaction histories. Current memory-augmented agents struggle with this task because they rely on a static retrieve-then-reason approach, which prevents them from dynamically adapting memory access to new evidence discovered during inference. To address this issue, MRAgent combines an associative memory graph with an active reconstruction mechanism. The memory graph represents information as a network of cues, tags, and contents, where tags serve as semantic bridges between cues and contents. The active reconstruction mechanism integrates language model reasoning directly into memory access, allowing the agent to iteratively explore and refine retrieval paths based on accumulated evidence. This approach enables the agent to dynamically adapt memory retrieval to the reasoning context, avoiding the need to consider all possible retrieval paths and reducing computational costs. The authors evaluate MRAgent on two benchmarks, LoCoMo and LongMemEval, and demonstrate significant improvements over strong baselines, with up to 23% better performance, while also reducing token and runtime costs. Overall, the paper contributes a new framework for long-horizon memory reasoning that is more efficient and effective than existing approaches.


📅 Published on Jun 4

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.06036
• PDF: https://arxiv.org/pdf/2606.06036

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GraphMemoryModels #LLMAgents #MemoryReconstruction #AssociativeMemoryGraphs #LongTermReasoningMechanisms