AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

747 views07:53

545 views03:50

🔥 OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

💡 The paper proposes a new method called OSCAR for ultra-low-bit key-value cache quantization, which is crucial for efficient deployment of large language models. The problem addressed is that existing quantization methods, such as simple rotations like Hadamard transforms, degrade in accuracy when applied to very low-bit representations, like 2-bit integers. This degradation occurs because these methods do not account for the attention-aware covariance structures that the model actually uses.

To solve this problem, OSCAR estimates the attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. This approach aligns the key-value cache quantization with the covariance structures that the model consumes, leading to higher accuracy and efficiency.

The authors provide theoretical justification for OSCAR and develop a fully deployable system that is compatible with modern large language model serving frameworks. They evaluate OSCAR on several reasoning models with long context lengths, up to 32,000 tokens, and achieve significant improvements in accuracy compared to naive rotation methods. Specifically, OSCAR reduces the accuracy gap to 3.78 and 1.42 points on two models, while naive rotation methods collapse to nearly zero.

The results also show that OSCAR scales well to larger models, remaining effectively on par with higher-precision representations. Additionally, OSCAR achieves significant system-wise improvements, including reducing key-value cache memory by approximately 8 times, improving throughput by up to 7 times, and accelerating batch-size-1 decoding by up to 3 times over higher-precision representations. Overall, the paper demonstrates that OSCAR is an effective and efficient method for ultra-low-bit key-value cache quantization, enabling the deployment of large language models with high accuracy and efficiency.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.17757
• PDF: https://arxiv.org/pdf/2605.17757
• Project Page: https://oscar-quantize.github.io/

🤖 Models citing this paper:
• https://huggingface.co/Zhongzhu/OSCAR-RotationZoo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#QuantizationMethods #LowBitRepresentations #KeyvalueCache #SpectralCovariance #EfficientDeployment

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤3

622 views03:50

438 views13:50

🔥 VLM3: Vision Language Models Are Native 3D Learners

💡 The paper VLM3 Vision Language Models Are Native 3D Learners presents a study that challenges the common approach to 3D understanding tasks in computer vision. Typically these tasks rely on specialized vision models with complex designs and extensive data augmentation. However the authors argue that vision language models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training.

The problem addressed in this paper is that 3D understanding tasks such as depth estimation and object-level 3D understanding are currently dominated by expert vision models that have complex task-specific designs. The authors propose that vision language models can be native 3D learners and achieve comparable performance to these specialized models.

The method used in this study involves making three simple modifications to standard vision language models. These modifications include focal length unification, text-based pixel reference, and data mixture and scaling. The authors propose VLM3, a scalable method that enables standard vision language models to master diverse 3D tasks without requiring complex designs or extensive data augmentation.

The results of the study show that VLM3 advances the depth estimation accuracy of vision language models by a large margin, from 0.84 to 0.9. Additionally, VLM3 enables diverse 3D tasks such as pixel correspondence, camera pose estimation, and object-level 3D understanding, matching the accuracy of expert vision models while maintaining standard architectures and text-based training. Overall, the paper presents a new paradigm for simple and scalable 3D learning, demonstrating that vision language models can be effective native 3D learners.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

370 views13:50

🔥 GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

💡 The paper addresses the problem of real-world image restoration being limited by the lack of high-quality paired training data. Existing synthetic datasets often fail to model real-world degradations, while capturing real-world paired datasets is expensive and difficult. To overcome this, the authors propose using generative multimodal foundation models to produce high-quality targets from real-world low-quality images, referred to as Generative Ground Truth.

The authors systematically evaluate nine state-of-the-art models and find that one model, Nano-Banana-2 with adaptive prompting, is particularly effective at synthesizing realistic and content-faithful high-quality targets. They then use this model to build a dataset, GGT-100K, which consists of over 103,000 low-quality and high-quality paired images covering diverse scenes and real-world degradations.

The results show that using GGT-100K as a training dataset consistently improves the real-world generalization of a wide range of image restoration models, particularly when fine-tuning generative models. The authors conclude that their approach can serve as a practical tool for generating high-quality training data for image restoration tasks, and that GGT-100K is a useful resource for expanding the generalization capabilities of real-world image restoration models.

📅 Published on May 29

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.31039
• PDF: https://arxiv.org/pdf/2605.31039
• Project Page: https://polyu-vclab.github.io/GGT-100K/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/VCLab-PolyU/GGT-100K

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ImageRestoration #GenerativeGroundTruth #RealWorldDegradations #MultimodalFoundationModels #GenerativeMultimodalLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

371 views13:51

This media is not supported in your browser

0:06

341 views13:51

310 views13:51

🔥 Scaling Agents via Continual Pre-training

💡 The paper addresses the issue of large language models underperforming in agentic tasks despite being capable of autonomous tool use and multi-step reasoning. The root cause of this underperformance is identified as the lack of robust agentic foundation models, which forces models to learn diverse agentic behaviors and align them to expert demonstrations simultaneously during post-training, resulting in optimization tensions. To overcome this, the authors propose incorporating Agentic Continual Pre-training into the training pipeline to build powerful agentic foundational models. They develop a deep research agent model called AgentFounder based on this approach. The AgentFounder model is evaluated on 10 benchmarks and achieves state-of-the-art performance while retaining strong tool-use ability, with notable results including 39.9 percent on BrowseComp-en, 43.3 percent on BrowseComp-zh, and 31.5 percent Pass at 1 on HLE. The contributions of the paper include the introduction of Agentic Continual Pre-training and the development of the AgentFounder model, which demonstrates the effectiveness of this approach in building robust agentic foundation models.

📅 Published on Sep 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.13310
• PDF: https://arxiv.org/pdf/2509.13310
• Project Page: https://tongyi-agent.github.io/blog/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticFoundationModels #ContinualPretraining #AutonomousToolUse #MultiStepReasoning #AgenticBehaviorLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

386 views13:51

306 views13:51

🔥 WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

💡 The paper introduces WebShaper, a framework that synthesizes information-seeking datasets to improve the performance of artificial intelligence agents. The problem addressed is the scarcity of high-quality training data for information-seeking tasks, which are complex and open-ended. Existing approaches typically collect web data and then generate questions, but this can lead to inconsistencies between the information structure and the reasoning structure of the questions and answers.

To solve this problem, WebShaper uses a formalization-driven approach based on set theory and Knowledge Projections. This approach enables precise control over the reasoning structure of the synthesized data. The framework starts by creating seed tasks and then expands them into more complex questions using a multi-step process. The expansion process involves an agentic Expander that uses retrieval and validation tools to ensure the quality of the synthesized data.

The key contribution of WebShaper is its ability to systematically formalize information-seeking tasks and synthesize high-quality datasets. The framework is evaluated on two open-sourced benchmarks, GAIA and WebWalkerQA, and achieves state-of-the-art performance. The results demonstrate that WebShaper is effective in synthesizing datasets that can train information-seeking agents to achieve top performance. Overall, WebShaper provides a novel solution to the problem of data scarcity in information-seeking tasks and has the potential to improve the performance of artificial intelligence agents in complex and open-ended tasks.

📅 Published on Jul 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• Project Page: https://huggingface.co/papers?q=Knowledge%20Projections%20(KP)
• arXiv: https://arxiv.org/abs/2507.15061
• PDF: https://arxiv.org/pdf/2507.15061

🤖 Models citing this paper:
• https://huggingface.co/Alibaba-NLP/WebShaper-32B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Alibaba-NLP/WebShaper
• https://huggingface.co/datasets/JingmingChen/PathRefiner

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #InformationSeekingTasks #DataSynthesisTechniques #KnowledgeProjections #FormalizationDrivenApproaches

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

372 views13:51

347 views13:51

🔥 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

💡 The paper introduces WebWatcher, a multimodal agent designed to improve visual-language reasoning in deep research tasks. The problem addressed is that most existing research agents are text-centric and overlook visual information, making multimodal deep research challenging. To solve this, WebWatcher is equipped with enhanced visual-language reasoning capabilities, leveraging synthetic multimodal trajectories for efficient training, utilizing various tools for deep reasoning, and enhancing generalization through reinforcement learning.

The method involves using high-quality synthetic multimodal trajectories for cold start training, which allows the agent to learn from both visual and textual information. The agent is also designed to work with various tools to improve its reasoning abilities. Additionally, the paper proposes a new benchmark called BrowseComp-VL, which is used to evaluate the capabilities of multimodal agents in complex information retrieval tasks involving both visual and textual information.

The results show that WebWatcher significantly outperforms existing baseline agents, including proprietary and open-source agents, in four challenging visual question answering benchmarks. This demonstrates the effectiveness of WebWatcher in solving complex multimodal information-seeking tasks and paves the way for further research in this area. Overall, the paper contributes to the development of multimodal agents with stronger reasoning abilities, which can handle both visual and textual information, and provides a new benchmark for evaluating the performance of such agents.

📅 Published on Aug 7, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2508.05748
• PDF: https://arxiv.org/pdf/2508.05748
• Project Page: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/

🤖 Models citing this paper:
• https://huggingface.co/Alibaba-NLP/WebWatcher-32B
• https://huggingface.co/Alibaba-NLP/WebWatcher-7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisionLanguageReasoning #DeepResearchAgents #SyntheticMultimodalTrajectories #ReinforcementLearningForVision

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

531 views13:51

495 views13:51

🔥 LongTraceRL: Learning Long-Context Reasoning from Search Agent Trajectories with Rubric Rewards

💡 The paper LongTraceRL addresses the challenge of long-context reasoning in large language models. Long-context reasoning is a central challenge for these models as they often fail to locate and integrate key information in extensive distracting content. Existing methods using reinforcement learning with verifiable rewards have shown promise but are limited by low-confusability distractors and sparse reward signals that cannot supervise intermediate reasoning steps.

To address these issues, the authors introduce LongTraceRL, a method that uses tiered distractor construction and rubric reward design to improve reasoning quality. For data construction, the authors generate multi-hop questions via knowledge graph random walks and leverage search agent trajectories to build tiered distractors. These distractors include documents the agent read but did not cite, which are high in confusability, and documents that appeared in search results but were never opened, which are low in confusability. This approach produces training contexts that are far more challenging than those built by random sampling or one-shot search.

The authors also propose a rubric reward that uses gold entities along each reasoning chain as fine-grained, entity-level process supervision. This reward is applied only to responses with correct final answers, which distinguishes the reasoning quality among correct responses and prevents reward hacking.

The experiments on three reasoning large language models across five long-context benchmarks demonstrate that LongTraceRL consistently outperforms strong baselines and encourages comprehensive, evidence-grounded reasoning. The results show that LongTraceRL is effective in improving the long-context reasoning capabilities of large language models. The codes, datasets, and models are available for further research and development. Overall, LongTraceRL provides a new approach to addressing the challenge of long-context reasoning in large language models and has the potential to improve the performance of these models in a variety of applications.

📅 Published on May 29

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.31584
• PDF: https://arxiv.org/pdf/2605.31584

🤖 Models citing this paper:
• https://huggingface.co/THU-KEG/LongTraceRL-4B
• https://huggingface.co/THU-KEG/LongTraceRL-8B
• https://huggingface.co/THU-KEG/LongTraceRL-30B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/THU-KEG/LongTraceRL

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongContextReasoning #ReinforcementLearning #LargeLanguageModels #RubricRewards #SearchAgentTrajectories

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

637 views13:51

461 views23:51

🔥 COLLEAGUE.SKILL: Automated AI Skill Generation via Expert Knowledge Distillation

💡 The paper introduces COLLEAGUE SKILL, a system for automatically generating person-grounded AI skills from heterogeneous traces of expert knowledge. The problem addressed is that current methods for creating AI agents that mimic human expertise and judgment are limited, as they rely on fragmented evidence and lack a comprehensive workflow for distilling this knowledge into usable skills.

The method presented involves an automated trace-to-skill distillation process that takes materials from a target person or role and produces a versioned skill package with two tracks: a capability track for practices, mental models, and decision heuristics, and a bounded behavior track for communication style, interaction rules, and correction history. This package can be inspected, updated through natural-language feedback, and deployed across agent hosts.

The results of the system are significant, with the open-source repository having approximately 18.5k GitHub stars, 215 skills from 165 contributors, and over 100k cumulative stars across listed skill cards. The system demonstrates how person-grounded skills can be represented as portable, correctable packages, rather than opaque prompts or hidden memories, and provides a comprehensive workflow for generating and deploying these skills. The paper presents the artifact contract, generation workflow, correction lifecycle, deployment surface, and domain presets implemented in the system, showcasing its potential for creating more human-like AI agents.

📅 Published on May 29

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.31264
• PDF: https://arxiv.org/pdf/2605.31264

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligence #ExpertKnowledgeDistillation #AI_skill_generation #HumanCenteredAI #KnowledgeGraphEmbedding

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

526 views23:51

493 views23:52