AI & ML Papers
32.8K subscribers
7.07K photos
523 videos
24 files
7.72K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Instruction-Guided Lesion Segmentation for Chest X-rays with Automatically Generated Large-Scale Dataset

📝 Summary:
Researchers introduce Instruction-Guided Lesion Segmentation ILS for CXRs, allowing diverse lesion segmentation using simple instructions. They developed MIMIC-ILS, a large-scale dataset, and ROSALIA, a vision-language model. ROSALIA accurately segments various lesions and provides textual explan...

🔹 Publication Date: Published on Nov 19

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.15186
• PDF: https://arxiv.org/pdf/2511.15186

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#MedicalAI #LesionSegmentation #ChestXray #VisionLanguageModel #DeepLearning
Hulu-Med: A Transparent Generalist Model towards Holistic Medical Vision-Language Understanding

📝 Summary:
Hulu-Med is a transparent medical vision-language model unifying diverse data modalities like text, 2D/3D images, and video. It achieves state-of-the-art performance across 30 clinical benchmarks with efficient training, promoting accessible AI.

🔹 Publication Date: Published on Oct 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2510.08668
• PDF: https://arxiv.org/pdf/2510.08668
• Github: https://github.com/ZJUI-AI4H/Hulu-Med

🔹 Models citing this paper:
https://huggingface.co/ZJU-AI4H/Hulu-Med-32B
https://huggingface.co/ZJU-AI4H/Hulu-Med-7B
https://huggingface.co/ZJU-AI4H/Hulu-Med-14B

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#MedicalAI #VisionLanguageModel #MultimodalAI #HealthcareAI #AIResearch
❤‍🔥1
HunyuanOCR Technical Report

📝 Summary:
HunyuanOCR is a lightweight Vision-Language Model for OCR, using a unified end-to-end architecture ViT + LLM. It achieves state-of-the-art performance in diverse tasks, outperforming larger models and commercial APIs, powered by data-driven and RL strategies.

🔹 Publication Date: Published on Nov 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.19575
• PDF: https://arxiv.org/pdf/2511.19575
• Github: https://github.com/Tencent-Hunyuan/HunyuanOCR

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#OCR #VisionLanguageModel #LLM #AI #MachineLearning
Qwen3-VL Technical Report

📝 Summary:
Qwen3-VL is a highly capable vision-language model, achieving superior performance across multimodal benchmarks. It supports 256K interleaved contexts and offers strong text understanding, robust long-context comprehension, and advanced multimodal reasoning through key architectural upgrades.

🔹 Publication Date: Published on Nov 26

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.21631
• PDF: https://arxiv.org/pdf/2511.21631

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisionLanguageModel #MultimodalAI #AI #DeepLearning #LLM
MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

📝 Summary:
MomaGraph-R1, a vision-language model trained with reinforcement learning, achieves state-of-the-art performance in predicting task-oriented scene graphs and zero-shot task planning in household envir...

🔹 Publication Date: Published on Dec 18

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.16909
• PDF: https://arxiv.org/pdf/2512.16909
• Github: https://hybridrobotics.github.io/MomaGraph/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisionLanguageModel #EmbodiedAI #ReinforcementLearning #SceneGraphs #Robotics
2
Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset

📝 Summary:
This paper introduces IMDD-1M, a large dataset of 1 million industrial defect image-text pairs. It enables training a vision-language foundation model tailored for industrial use. This model achieves comparable performance with less data for specialized tasks, promoting data-efficient quality ins...

🔹 Publication Date: Published on Dec 30, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.24160
• PDF: https://arxiv.org/pdf/2512.24160

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#IndustrialAI #VisionLanguageModel #DefectDetection #MultimodalAI #ComputerVision
dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

📝 Summary:
dots.ocr is a unified Vision-Language Model that jointly learns document layout parsing tasks, overcoming limitations of multi-stage pipelines. It achieves state-of-the-art performance on OmniDocBench and sets a new baseline on the challenging multilingual XDocParse benchmark.

🔹 Publication Date: Published on Dec 2, 2025

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/dotsocr-multilingual-document-layout-parsing-in-a-single-vision-language-model
• PDF: https://arxiv.org/pdf/2512.02498
• Github: https://github.com/rednote-hilab/dots.ocr

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisionLanguageModel #DocumentParsing #MultilingualAI #AIResearch #DeepLearning
1
LightOnOCR: A 1B End-to-End Multilingual Vision-Language Model for State-of-the-Art OCR

📝 Summary:
LightOnOCR-2-1B is a 1B-parameter end-to-end multilingual vision-language model for OCR. It converts document images to text, achieving state-of-the-art results while being smaller and faster. It also features improved image localization and robustness.

🔹 Publication Date: Published on Jan 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.14251
• PDF: https://arxiv.org/pdf/2601.14251

🔹 Models citing this paper:
https://huggingface.co/lightonai/LightOnOCR-1B-1025
https://huggingface.co/lightonai/LightOnOCR-2-1B
https://huggingface.co/lightonai/LightOnOCR-0.9B-32k-1025

Datasets citing this paper:
https://huggingface.co/datasets/lightonai/LightOnOCR-mix-0126
https://huggingface.co/datasets/lightonai/LightOnOCR-bbox-mix-0126

Spaces citing this paper:
https://huggingface.co/spaces/lightonai/LightOnOCR-2-1B-Demo
https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo
https://huggingface.co/spaces/lightonai/LightOnOCR-1B-Demo-zero

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#OCR #VisionLanguageModel #AI #DeepLearning #MultilingualAI
Qianfan-OCR: A Unified End-to-End Model for Document Intelligence

📝 Summary:
Qianfan-OCR is a 4B vision-language model that unifies document parsing, layout analysis, and understanding. It features Layout-as-Thought to improve accuracy on complex layouts and achieves state-of-the-art performance across multiple OCR and document intelligence benchmarks.

🔹 Publication Date: Published on Mar 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.13398
• PDF: https://arxiv.org/pdf/2603.13398
• Project Page: https://github.com/baidubce/Qianfan-VL
• Github: https://github.com/baidubce/Qianfan-VL

🔹 Models citing this paper:
https://huggingface.co/baidu/Qianfan-OCR

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#OCR #DocumentIntelligence #VisionLanguageModel #AI #MachineLearning
EXAONE 4.5 Technical Report

📝 Summary:
EXAONE 4.5 is LG AI Research's first open-weight vision language model, integrating a visual encoder into EXAONE 4.0. It enhances document understanding and general language capabilities through targeted data and extended context, outperforming similar models in document tasks.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08644
• PDF: https://arxiv.org/pdf/2604.08644
• Github: https://github.com/LG-AI-EXAONE/EXAONE-4.5

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisionLanguageModel #AI #DocumentUnderstanding #MultimodalAI #OpenSourceAI
AI & ML Papers
Photo
🔥 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

💡 The paper introduces MinerU2.5, a 1.2 billion parameter vision-language model designed for efficient high-resolution document parsing. The model achieves state-of-the-art recognition accuracy while maintaining computational efficiency through a two-stage parsing strategy. In the first stage, the model performs layout analysis on downsampled images to identify structural elements, reducing computational overhead. In the second stage, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, the authors developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. The results demonstrate that MinerU2.5 achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead. Overall, the paper contributes a novel approach to document parsing that balances accuracy and efficiency, making it suitable for a wide range of applications.


📅 Published on Sep 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• GitHub: https://github.com/opendatalab/MinerU 61.9k

🤖 Models citing this paper:
https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
https://huggingface.co/freakynit/MinerU2.5-2509-1.2B

🚀 Spaces citing this paper:
https://huggingface.co/spaces/xiaoye-winters/MinerU-API
https://huggingface.co/spaces/opendatalab/MinerU-Diffusion-V1-0320-2.5B
https://huggingface.co/spaces/Instantnewdesign/document_extract

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentParsing #VisionLanguageModel #HighResolutionImageProcessing #LayoutAnalysis #ContentRecognition
4
AI & ML Papers
Photo
🔥 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

💡 The paper introduces SmolDocling, a compact vision-language model designed for end-to-end document conversion. The model aims to process entire pages and generate a new universal markup format called DocTags, which captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models or ensemble solutions, SmolDocling offers a single end-to-end conversion model with 256M parameters. This approach allows for accurately capturing content, structure, and spatial location of document elements.

The model is trained to reproduce document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types, including business documents, academic papers, technical reports, patents, and forms. The authors also contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.

Experimental results demonstrate that SmolDocling performs competitively with other vision language models that are up to 27 times larger in size, while reducing computational requirements substantially. The model's compact size and robust performance make it a significant contribution to the field of document conversion. The authors plan to make the model and datasets publicly available, which will facilitate further research and development in this area. Overall, SmolDocling offers a efficient and effective solution for end-to-end document conversion, with potential applications in various industries and domains.


📅 Published on Mar 14, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2503.11576
• PDF: https://arxiv.org/pdf/2503.11576
• Project Page: https://huggingface.co/ds4sd/SmolDocling-256M-preview
• GitHub: https://github.com/docling-project/docling 59.1k

🤖 Models citing this paper:
https://huggingface.co/docling-project/SmolDocling-256M-preview
https://huggingface.co/ibm-granite/granite-docling-258M
https://huggingface.co/docling-project/CodeFormulaV2

📊 Datasets citing this paper:
https://huggingface.co/datasets/mnezama/SynthCodeNet
https://huggingface.co/datasets/docling-project/SynthCodeNet
https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix

🚀 Spaces citing this paper:
https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo
https://huggingface.co/spaces/ibm-granite/granite-docling-258M-WebGPU
https://huggingface.co/spaces/jairwaal/image

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentConversion #VisionLanguageModel #MultimodalProcessing #EndToEndLearning #DocumentUnderstanding
AI & ML Papers
Photo
🔥 RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

💡 The paper introduces RoboMemArena, a comprehensive robotic memory benchmark that addresses the limitations of existing benchmarks by providing a large-scale and diverse set of tasks with real-world evaluation. The benchmark consists of 26 tasks with average trajectory lengths of over 1000 steps per task, and 68.9 percent of subtasks require memory dependence. The tasks are generated using a vision-language model that designs and composes subtasks, generates full trajectories, and provides memory-related annotations.

To tackle the challenges of the RoboMemArena benchmark, the authors propose PrediMem, a dual-system vision-language architecture that improves memory management through predictive coding. PrediMem consists of a high-level vision-language model planner that manages a memory bank with recent and keyframe buffers, and uses a predictive coding head to enhance sensitivity to task dynamics.

The authors evaluate PrediMem on the RoboMemArena benchmark and demonstrate that it outperforms all baseline models. The results provide insights into memory management, model architecture, and scaling laws for complex memory systems. The paper contributes to the development of robotic intelligence by providing a comprehensive benchmark and a state-of-the-art model that can effectively manage memory in partially observable environments.

The key contributions of the paper are the introduction of the RoboMemArena benchmark, which provides a challenging and diverse set of tasks for evaluating robotic memory, and the proposal of the PrediMem model, which demonstrates improved memory management through predictive coding. The paper also provides a thorough evaluation of the PrediMem model on the RoboMemArena benchmark, highlighting its effectiveness in managing memory in complex tasks. Overall, the paper advances the state-of-the-art in robotic memory and provides a foundation for future research in this area.


📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10921
• PDF: https://arxiv.org/pdf/2605.10921
• Project Page: https://robomemarena.github.io/
• GitHub: https://github.com/OpenHelix-Team/RoboMemArena 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticMemoryBenchmark #VisionLanguageModel #RoboticsAndMemory #ArtificialIntelligenceBenchmarking #RoboMemArena