AI & ML Papers

✨Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning

251 views04:01

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

📝 Summary:
Hulu-Med is a transparent medical vision-language model unifying diverse data modalities like text, 2D/3D images, and video. It achieves state-of-the-art performance across 30 clinical benchmarks with efficient training, promoting accessible AI.

🔹 Publication Date: Published on Oct 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.08668
• PDF: https://arxiv.org/pdf/2510.08668
• Github: https://github.com/ZJUI-AI4H/Hulu-Med

🔹 Models citing this paper:
• https://huggingface.co/ZJU-AI4H/Hulu-Med-32B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-7B
• https://huggingface.co/ZJU-AI4H/Hulu-Med-14B

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#MedicalAI #VisionLanguageModel #MultimodalAI #HealthcareAI #AIResearch

❤‍🔥1

504 views00:00

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨HunyuanOCR Technical Report

📝 Summary:
HunyuanOCR is a lightweight Vision-Language Model for OCR, using a unified end-to-end architecture ViT + LLM. It achieves state-of-the-art performance in diverse tasks, outperforming larger models and commercial APIs, powered by data-driven and RL strategies.

🔹 Publication Date: Published on Nov 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19575
• PDF: https://arxiv.org/pdf/2511.19575
• Github: https://github.com/Tencent-Hunyuan/HunyuanOCR

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#OCR #VisionLanguageModel #LLM #AI #MachineLearning

231 views04:02

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Qwen3-VL Technical Report

📝 Summary:
Qwen3-VL is a highly capable vision-language model, achieving superior performance across multimodal benchmarks. It supports 256K interleaved contexts and offers strong text understanding, robust long-context comprehension, and advanced multimodal reasoning through key architectural upgrades.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21631
• PDF: https://arxiv.org/pdf/2511.21631

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisionLanguageModel #MultimodalAI #AI #DeepLearning #LLM

334 views08:00

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

📝 Summary:
MomaGraph-R1, a vision-language model trained with reinforcement learning, achieves state-of-the-art performance in predicting task-oriented scene graphs and zero-shot task planning in household envir...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16909
• PDF: https://arxiv.org/pdf/2512.16909
• Github: https://hybridrobotics.github.io/MomaGraph/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisionLanguageModel #EmbodiedAI #ReinforcementLearning #SceneGraphs #Robotics

❤2

448 views17:07

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

📝 Summary:
This paper introduces IMDD-1M, a large dataset of 1 million industrial defect image-text pairs. It enables training a vision-language foundation model tailored for industrial use. This model achieves comparable performance with less data for specialized tasks, promoting data-efficient quality ins...

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24160
• PDF: https://arxiv.org/pdf/2512.24160

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#IndustrialAI #VisionLanguageModel #DefectDetection #MultimodalAI #ComputerVision

274 views09:04

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

📝 Summary:
dots.ocr is a unified Vision-Language Model that jointly learns document layout parsing tasks, overcoming limitations of multi-stage pipelines. It achieves state-of-the-art performance on OmniDocBench and sets a new baseline on the challenging multilingual XDocParse benchmark.

🔹 Publication Date: Published on Dec 2, 2025

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/dotsocr-multilingual-document-layout-parsing-in-a-single-vision-language-model
• PDF: https://arxiv.org/pdf/2512.02498
• Github: https://github.com/rednote-hilab/dots.ocr

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisionLanguageModel #DocumentParsing #MultilingualAI #AIResearch #DeepLearning

❤1

857 views01:00

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

📝 Summary:
LightOnOCR-2-1B is a 1B-parameter end-to-end multilingual vision-language model for OCR. It converts document images to text, achieving state-of-the-art results while being smaller and faster. It also features improved image localization and robustness.

🔹 Publication Date: Published on Jan 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.14251
• PDF: https://arxiv.org/pdf/2601.14251

🔹 Models citing this paper:
• https://huggingface.co/lightonai/LightOnOCR-1B-1025
• https://huggingface.co/lightonai/LightOnOCR-2-1B
• https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025

✨ Datasets citing this paper:
• https://huggingface.co/datasets/lightonai/LightOnOCR-mix-0126
• https://huggingface.co/datasets/lightonai/LightOnOCR-bbox-mix-0126

✨ Spaces citing this paper:
• https://huggingface.co/spaces/lightonai/LightOnOCR-2-1B-Demo
• https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo
• https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo-zero

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#OCR #VisionLanguageModel #AI #DeepLearning #MultilingualAI

arXiv.org

LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for...

We present \textbf{LightOnOCR-2-1B}, a 1B-parameter end-to-end multilingual vision--language model that converts document images (e.g., PDFs) into clean, naturally ordered text without brittle OCR...

275 views11:04

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

📝 Summary:
Qianfan-OCR is a 4B vision-language model that unifies document parsing, layout analysis, and understanding. It features Layout-as-Thought to improve accuracy on complex layouts and achieves state-of-the-art performance across multiple OCR and document intelligence benchmarks.

🔹 Publication Date: Published on Mar 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13398
• PDF: https://arxiv.org/pdf/2603.13398
• Project Page: https://github.com/baidubce/Qianfan-VL
• Github: https://github.com/baidubce/Qianfan-VL

🔹 Models citing this paper:
• https://huggingface.co/baidu/Qianfan-OCR

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#OCR #DocumentIntelligence #VisionLanguageModel #AI #MachineLearning

120 views09:06

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨EXAONE 4.5 Technical Report

📝 Summary:
EXAONE 4.5 is LG AI Research's first open-weight vision language model, integrating a visual encoder into EXAONE 4.0. It enhances document understanding and general language capabilities through targeted data and extended context, outperforming similar models in document tasks.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08644
• PDF: https://arxiv.org/pdf/2604.08644
• Github: https://github.com/LG-AI-EXAONE/EXAONE-4.5

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisionLanguageModel #AI #DocumentUnderstanding #MultimodalAI #OpenSourceAI

176 views02:01

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

Photo

🔥 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

💡 The paper introduces MinerU2.5, a 1.2 billion parameter vision-language model designed for efficient high-resolution document parsing. The model achieves state-of-the-art recognition accuracy while maintaining computational efficiency through a two-stage parsing strategy. In the first stage, the model performs layout analysis on downsampled images to identify structural elements, reducing computational overhead. In the second stage, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, the authors developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. The results demonstrate that MinerU2.5 achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead. Overall, the paper contributes a novel approach to document parsing that balances accuracy and efficiency, making it suitable for a wide range of applications.

📅 Published on Sep 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• GitHub: https://github.com/opendatalab/MinerU ⭐ 61.9k

🤖 Models citing this paper:
• https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
• https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/freakynit/MinerU2.5-2509-1.2B

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/xiaoye-winters/MinerU-API
• https://huggingface.co/spaces/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/spaces/Instantnewdesign/document_extract

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentParsing #VisionLanguageModel #HighResolutionImageProcessing #LayoutAnalysis #ContentRecognition

arXiv.org

MinerU2.5: A Decoupled Vision-Language Model for Efficient...

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our...

❤4

543 views12:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Photo

🔥 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

💡 The paper introduces SmolDocling, a compact vision-language model designed for end-to-end document conversion. The model aims to process entire pages and generate a new universal markup format called DocTags, which captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models or ensemble solutions, SmolDocling offers a single end-to-end conversion model with 256M parameters. This approach allows for accurately capturing content, structure, and spatial location of document elements.

The model is trained to reproduce document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types, including business documents, academic papers, technical reports, patents, and forms. The authors also contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.

Experimental results demonstrate that SmolDocling performs competitively with other vision language models that are up to 27 times larger in size, while reducing computational requirements substantially. The model's compact size and robust performance make it a significant contribution to the field of document conversion. The authors plan to make the model and datasets publicly available, which will facilitate further research and development in this area. Overall, SmolDocling offers a efficient and effective solution for end-to-end document conversion, with potential applications in various industries and domains.

arXiv.org

SmolDocling: An ultra-compact vision-language model for end-to-end...

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal...

430 views00:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Photo

🔥 RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

💡 The paper introduces RoboMemArena, a comprehensive robotic memory benchmark that addresses the limitations of existing benchmarks by providing a large-scale and diverse set of tasks with real-world evaluation. The benchmark consists of 26 tasks with average trajectory lengths of over 1000 steps per task, and 68.9 percent of subtasks require memory dependence. The tasks are generated using a vision-language model that designs and composes subtasks, generates full trajectories, and provides memory-related annotations.

To tackle the challenges of the RoboMemArena benchmark, the authors propose PrediMem, a dual-system vision-language architecture that improves memory management through predictive coding. PrediMem consists of a high-level vision-language model planner that manages a memory bank with recent and keyframe buffers, and uses a predictive coding head to enhance sensitivity to task dynamics.

The authors evaluate PrediMem on the RoboMemArena benchmark and demonstrate that it outperforms all baseline models. The results provide insights into memory management, model architecture, and scaling laws for complex memory systems. The paper contributes to the development of robotic intelligence by providing a comprehensive benchmark and a state-of-the-art model that can effectively manage memory in partially observable environments.

The key contributions of the paper are the introduction of the RoboMemArena benchmark, which provides a challenging and diverse set of tasks for evaluating robotic memory, and the proposal of the PrediMem model, which demonstrates improved memory management through predictive coding. The paper also provides a thorough evaluation of the PrediMem model on the RoboMemArena benchmark, highlighting its effectiveness in managing memory in complex tasks. Overall, the paper advances the state-of-the-art in robotic memory and provides a foundation for future research in this area.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10921
• PDF: https://arxiv.org/pdf/2605.10921
• Project Page: https://robomemarena.github.io/
• GitHub: https://github.com/OpenHelix-Team/RoboMemArena ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticMemoryBenchmark #VisionLanguageModel #RoboticsAndMemory #ArtificialIntelligenceBenchmarking #RoboMemArena

arXiv.org

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However,...

433 views03:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform