AI & ML Papers

🔥 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

💡 The paper introduces MinerU2.5, a 1.2 billion parameter vision-language model designed for efficient high-resolution document parsing. The model achieves state-of-the-art recognition accuracy while maintaining computational efficiency through a two-stage parsing strategy. In the first stage, the model performs layout analysis on downsampled images to identify structural elements, reducing computational overhead. In the second stage, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, the authors developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. The results demonstrate that MinerU2.5 achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead. Overall, the paper contributes a novel approach to document parsing that balances accuracy and efficiency, making it suitable for a wide range of applications.

📅 Published on Sep 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• GitHub: https://github.com/opendatalab/MinerU ⭐ 61.9k

🤖 Models citing this paper:
• https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
• https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/freakynit/MinerU2.5-2509-1.2B

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/xiaoye-winters/MinerU-API
• https://huggingface.co/spaces/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/spaces/Instantnewdesign/document_extract

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentParsing #VisionLanguageModel #HighResolutionImageProcessing #LayoutAnalysis #ContentRecognition

arXiv.org

MinerU2.5: A Decoupled Vision-Language Model for Efficient...

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our...

❤4

543 views12:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform