AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

583 views15:50

547 views01:50

🔥 dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

💡 The paper introduces dots.ocr, a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing by jointly learning layout detection, text recognition, and relational understanding. The current methods for document layout parsing rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. The proposed model addresses this issue by using a single Vision-Language Model that jointly learns the three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, enabling the model to deliver robust performance across a wide array of tasks, languages, layouts, and domains. The model is validated on the OmniDocBench and XDocParse benchmarks, with the latter being a new challenging benchmark introduced in the paper that spans 126 languages. The results show that dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a 7.4 point margin and proving its unparalleled multilingual capabilities. The paper's contributions include the introduction of a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing, the creation of a new benchmark for multilingual document intelligence, and the demonstration of the advantages of jointly learning layout detection, text recognition, and relational understanding within a single model.

📅 Published on Dec 2, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

607 views01:50

🔥 Auditing Agent Harness Safety

💡 The paper Auditing Agent Harness Safety addresses the issue of ensuring safety constraints are met during the execution of large language model agents within execution harnesses. These agents can produce correct outputs while violating safety constraints during execution, which cannot be detected by evaluating only the final output. The authors propose a framework called HarnessAudit, which audits the full execution trajectory of agents across three dimensions: boundary compliance, execution fidelity, and system stability. They also introduce a benchmark called HarnessAudit-Bench, consisting of 210 tasks across eight real-world domains, to evaluate the safety of agent harnesses.

The authors evaluate ten harness configurations across different models and frameworks and find that task completion does not guarantee safe execution, and safety violations accumulate as the execution trajectory length increases. They also find that safety risks vary across domains, task types, and agent roles, with most violations occurring in resource access and inter-agent information transfer. Additionally, they discover that multi-agent collaboration increases the safety risk surface, while harness design sets the upper bound of safe deployment.

The paper's contributions include the development of the HarnessAudit framework and the HarnessAudit-Bench benchmark, which provide a comprehensive approach to auditing agent harness safety. The results highlight the importance of trajectory-level auditing and the need for careful harness design to ensure safe deployment of agent harnesses, particularly in multi-agent systems. Overall, the paper provides a significant step towards ensuring the safety and reliability of large language model agents in real-world applications.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14271
• PDF: https://arxiv.org/pdf/2605.14271
• Project Page: https://harnessaudit.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LanguageModelSafety #AgentHarnessSecurity #ExecutionTrajectoryAudit #SafetyConstraintEvaluation #HarnessComplianceAssessment

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

627 views11:51

514 views11:51

🔥 Unlocking Dense Metric Depth Estimation in VLMs

💡 The paper proposes DepthVLM, a framework that enhances Vision-Language Models with dense geometry prediction capabilities. Vision-Language Models are limited in 3D understanding due to their text-only supervision paradigm, which prevents the recovery of dense geometry. Prior methods have limitations such as error accumulation or inefficient prediction. DepthVLM addresses this by attaching a lightweight depth head to the model backbone and training it under a unified vision-text supervision paradigm with a two-stage schedule. This allows the model to generate full-resolution depth maps alongside language outputs in a single forward pass. The authors also introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. The results show that DepthVLM significantly outperforms existing Vision-Language Models, surpasses leading pure vision models, and improves complex 3D spatial reasoning, making it a step toward a truly unified foundation model. The code and checkpoints will be publicly released, making it accessible for further research and development. Overall, DepthVLM provides a simple yet effective solution for dense metric depth estimation in Vision-Language Models, unlocking their potential for 3D understanding and spatial reasoning.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/

🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

549 views11:51

Forwarded from Machine Learning with Python

🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://xn--r1a.website/+-WZeIeP8YI8wM2E6

You can join at this link! 👆👇

https://xn--r1a.website/+-WZeIeP8YI8wM2E6

❤1

358 views14:08

AI & ML Papers pinned a photo

15:45

458 views21:51

🔥 Lance: Unified Multimodal Modeling by Multi-Task Synergy

💡 The paper introduces Lance, a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos. The goal is to develop a model that can handle multiple tasks without relying on large model capacity or focusing on specific modalities like text or images. Lance achieves this through a dual-stream architecture and collaborative multi-task training, which enables joint context learning while separating the pathways for understanding and generation.

The model uses a mixture-of-experts architecture on shared multimodal sequences, allowing it to learn from both images and videos simultaneously. To address interference among different visual tokens, the model employs modality-aware rotary positional encoding, which helps to align tasks across different modalities.

During training, Lance uses a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling. This approach strengthens both semantic comprehension and visual generation performance. The results show that Lance outperforms existing unified models in image and video generation while maintaining strong multimodal understanding capabilities.

Overall, Lance presents a practical approach to unified multimodal modeling, demonstrating that collaborative multi-task training and a dual-stream architecture can lead to improved performance in multiple tasks without requiring large model capacity. The model has the potential to be applied to various applications that require multimodal understanding, generation, and editing capabilities.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18678
• PDF: https://arxiv.org/pdf/2605.18678
• Project Page: https://lance-project.github.io/

🤖 Models citing this paper:
• https://huggingface.co/bytedance-research/Lance

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Nayefleb/Lance

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalModeling #MultitaskLearning #DualStreamArchitecture #MixtureOfExperts #UnifiedModelingApproach

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

344 views21:51

🔥 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

💡 The paper introduces LongLive-2.0, a parallel infrastructure for long video generation that addresses training and inference bottlenecks. The problem with existing methods is that they are slow and require a lot of memory, especially for long videos. To solve this, the authors propose a sequence-parallel autoregressive training method called Balanced SP, which pairs clean-history and noisy-target temporal chunks on each rank, enabling efficient teacher-forcing and reducing GPU memory cost.

The method also uses NVFP4 precision to accelerate GEMM computation during training. Additionally, the authors tune a diffusion model into a long, multi-shot, interactive auto-regressive diffusion model, which can be converted to real-time generation with standalone LoRA weights. For inference, the authors enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding.

The results show that LongLive-2.0 achieves up to 2.15x speedup in training and 1.84x in inference. The LongLive-2.0-5B model achieves 45.7 FPS inference while attaining strong performance on benchmarks. The authors claim that LongLive-2.0 is the first NVFP4 training and inference system for long video generation, making it a significant contribution to the field. Overall, the paper presents a novel parallel infrastructure that addresses the speed and memory bottlenecks in long video generation, making it possible to generate high-quality videos in real-time.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18739
• PDF: https://arxiv.org/pdf/2605.18739
• Project Page: https://nvlabs.github.io/LongLive/LongLive2/

🤖 Models citing this paper:
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Efficient-Large-Model/LongLive2.0-Toy-Dataset

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongVideoGeneration #ParallelInfrastructure #NVFP4 #AutoregressiveTraining #DiffusionModeling

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

328 views21:51

278 views21:51

267 views21:52

🔥 AI for Auto-Research: Roadmap & User Guide

💡 The paper AI for Auto-Research Roadmap and User Guide examines the role of artificial intelligence in the research process, highlighting both its potential and limitations. The authors note that while AI systems can excel in structured tasks such as data analysis and paper writing, they often struggle with novel ideas, scientific judgment, and research-level experiments, requiring human oversight to ensure credible outcomes.

To investigate this further, the authors conducted an end-to-end analysis of AI across the entire research lifecycle, dividing it into four phases: Creation, Writing, Validation, and Dissemination. They found that AI is reliable in tasks that are structured, retrieval-grounded, and tool-mediated, but fragile when it comes to genuinely novel ideas and scientific judgment.

The study reveals that generated ideas often degrade after implementation, research code lags behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. Moreover, the authors show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm.

The paper provides several contributions, including a structured taxonomy, benchmark suite, and tool inventory, as well as cross-stage design principles and a practitioner-oriented playbook. The authors also maintain a project page with resources for further exploration. Overall, the study highlights the importance of human-AI collaboration in research, emphasizing that while AI can be a powerful tool, it is not yet ready to replace human scientists and researchers.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18661
• PDF: https://arxiv.org/pdf/2605.18661
• Project Page: https://worldbench.github.io/awesome-ai-auto-research

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceInResearch #AutoResearchTechnologies #AIForScientificDiscovery #MachineLearningInAcademia #ResearchProcessAutomation

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

289 views21:52

241 views21:52

🔥 Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

💡 The paper introduces Adaptive Chunking, a framework that optimizes chunking method selection for Retrieval-Augmented Generation RAG by using intrinsic document metrics. The effectiveness of RAG depends on how documents are segmented into smaller units, but traditional one-size-fits-all approaches often fail to capture the nuances of diverse texts. To address this, the authors propose a framework that selects the most suitable chunking strategy for each document based on five novel metrics: References Completeness, Intrachunk Cohesion, Document Contextual Coherence, Block Integrity, and Size Compliance. These metrics assess chunking quality across key dimensions. The authors also introduce two new chunkers and targeted post-processing techniques to support the framework. The results show that the adaptive method significantly improves downstream RAG performance, increasing answer correctness to 72% and the number of successfully answered questions by over 30%, without changing models or prompts. The framework demonstrates that adaptive, document-aware chunking guided by intrinsic metrics offers a practical path to more robust RAG systems. The code for the framework is available, making it possible for others to implement and build upon the research. Overall, the paper contributes to the development of more effective RAG systems by providing a novel approach to chunking that takes into account the unique characteristics of each document.

📅 Published on Mar 26

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.25333
• PDF: https://arxiv.org/pdf/2603.25333

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdaptiveChunking #RetrievalAugmentedGeneration #ChunkingMethodOptimization #DocumentSegmentationTechniques #RAGModelImprovements

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

332 views21:52

276 views21:52

🔥 Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

💡 The paper proposes a novel framework called Code-as-Room for generating 3D indoor rooms from top-down view images. The problem addressed is the difficulty in designing realistic and functional 3D indoor rooms, which is essential for various applications such as interior design, virtual reality, and gaming. Existing methods that use text-based descriptions or reference images struggle to capture precise spatial information and suffer from instability and infinite looping when tasked with holistic room generation.

The proposed method, Code-as-Room, uses a multilayer language model-based agentic framework with a structured execution harness to generate executable Blender code from top-down images. The framework parses the reference image to extract scene elements and their spatial relationships and synthesizes code for geometry, materials, and lighting in a multi-stage pipeline. A cross-stage memory module is used to maintain context and mitigate context forgetting.

The results show that the proposed framework is effective in generating 3D rooms from top-down images. A dedicated benchmark for code-based 3D room synthesis is introduced, which encompasses various evaluation protocols. Comprehensive comparisons against existing agent-based methods are conducted, validating the effectiveness of the proposed execution harness. The paper contributes to the field by providing a principled approach to 3D room synthesis from top-down views, addressing the limitations of existing methods and demonstrating the potential of using executable code as a representation for 3D rooms.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18451
• PDF: https://arxiv.org/pdf/2605.18451
• Project Page: https://code-as-room.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#CodeAsRoom #3DRoomGeneration #AgenticCodeSynthesis #IndoorSceneUnderstanding #ArchitectureGeneration

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

386 views21:52

362 views21:52

🔥 SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

💡 The paper introduces SkillsVote, a governance framework for managing reusable skills in long-horizon large language model agents. The problem addressed is that raw trajectories of agent experiences are noisy and hard to govern, making it difficult to reuse and improve agent skills. To solve this, the authors propose treating agent skills as an experience schema that combines executable scripts with non-executable guidance on procedures.

The SkillsVote framework consists of three main processes: collection, recommendation, and evolution of agent skills. It starts by profiling a large open-source corpus of skills to identify environment requirements, quality, and verifiability. Then, it synthesizes tasks for verifiable skills and performs a search over a structured skill library to provide instructional context before execution. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, and admits only successful reusable discoveries to updates.

The evaluation of SkillsVote shows promising results, with offline evolution improving performance on Terminal-Bench 2.0 by up to 7.9 percentage points and online evolution improving performance on SWE-Bench Pro by up to 2.6 percentage points. The key contribution of the paper is that governed external skill libraries can improve frozen agents without requiring model updates, as long as systems control exposure, credit, and preservation of skills. Overall, the SkillsVote framework provides a structured approach to managing and improving agent skills, enabling more efficient and effective reuse of experience and knowledge.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18401
• PDF: https://arxiv.org/pdf/2605.18401
• Project Page: https://skills.vote

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgentGovernance #LargeLanguageModels #SkillEvolution #ReusableSkills #LifecycleManagement

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

482 views21:52