✨SEM: Sparse Embedding Modulation for Post-Hoc Debiasing of Vision-Language Models
📝 Summary:
Sparse Embedding Modulation SEM debiases vision-language models by operating in a sparse autoencoder latent space. SEM precisely modulates bias-relevant neurons while preserving semantic information, achieving substantial fairness gains in retrieval and classification tasks.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19028
• PDF: https://arxiv.org/pdf/2603.19028
• Project Page: https://sparse-embedding-modulation.github.io/
• Github: https://github.com/mardgui/SEM
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #BiasCorrection #MachineLearning #AIResearch #DeepLearning
📝 Summary:
Sparse Embedding Modulation SEM debiases vision-language models by operating in a sparse autoencoder latent space. SEM precisely modulates bias-relevant neurons while preserving semantic information, achieving substantial fairness gains in retrieval and classification tasks.
🔹 Publication Date: Published on Mar 19
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.19028
• PDF: https://arxiv.org/pdf/2603.19028
• Project Page: https://sparse-embedding-modulation.github.io/
• Github: https://github.com/mardgui/SEM
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #BiasCorrection #MachineLearning #AIResearch #DeepLearning
✨VFIG: Vectorizing Complex Figures in SVG with Vision-Language Models
📝 Summary:
VFIG is a vision-language model that converts raster images into scalable vector graphics SVG. It employs a 66K dataset and hierarchical training for high-fidelity conversion, outperforming open-source models and matching proprietary ones.
🔹 Publication Date: Published on Mar 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2603.24575
• PDF: https://arxiv.org/pdf/2603.24575
• Project Page: https://vfig-proj.github.io/
• Github: https://github.com/RAIVNLab/VFig
🔹 Models citing this paper:
• https://huggingface.co/XunmeiLiu/VFIG-4B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/allenai/VFig-Image2SVG-Demo
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #SVG #VectorGraphics #AI #ComputerVision
📝 Summary:
VFIG is a vision-language model that converts raster images into scalable vector graphics SVG. It employs a 66K dataset and hierarchical training for high-fidelity conversion, outperforming open-source models and matching proprietary ones.
🔹 Publication Date: Published on Mar 25
🔹 Paper Links:
• arXiv Page: https://arxiv.org/pdf/2603.24575
• PDF: https://arxiv.org/pdf/2603.24575
• Project Page: https://vfig-proj.github.io/
• Github: https://github.com/RAIVNLab/VFig
🔹 Models citing this paper:
• https://huggingface.co/XunmeiLiu/VFIG-4B
✨ Spaces citing this paper:
• https://huggingface.co/spaces/allenai/VFig-Image2SVG-Demo
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #SVG #VectorGraphics #AI #ComputerVision
This media is not supported in your browser
VIEW IN TELEGRAM
✨Know3D: Prompting 3D Generation with Knowledge from Vision-Language Models
📝 Summary:
Know3D integrates vision-language models into 3D generation via latent hidden-state injection. This enables language-controlled synthesis of unseen back-views, transforming stochastic hallucination into a semantically guided process for 3D assets.
🔹 Publication Date: Published on Mar 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.22782
• PDF: https://arxiv.org/pdf/2603.22782
• Project Page: https://xishuxishu.github.io/Know3D.github.io/
• Github: https://github.com/xishuxishu/Know3D
✨ Spaces citing this paper:
• https://huggingface.co/spaces/xishushu/Know3D
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#3DGeneration #VisionLanguageModels #GenerativeAI #DeepLearning #AIResearch
📝 Summary:
Know3D integrates vision-language models into 3D generation via latent hidden-state injection. This enables language-controlled synthesis of unseen back-views, transforming stochastic hallucination into a semantically guided process for 3D assets.
🔹 Publication Date: Published on Mar 24
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.22782
• PDF: https://arxiv.org/pdf/2603.22782
• Project Page: https://xishuxishu.github.io/Know3D.github.io/
• Github: https://github.com/xishuxishu/Know3D
✨ Spaces citing this paper:
• https://huggingface.co/spaces/xishushu/Know3D
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#3DGeneration #VisionLanguageModels #GenerativeAI #DeepLearning #AIResearch
✨A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI
📝 Summary:
This paper finds that even state-of-the-art multi-billion parameter AI models struggle with surgical tool detection, a seemingly simple task. Scaling models further offers diminishing returns, suggesting fundamental limitations for current Vision Language Models in surgical use cases beyond just ...
🔹 Publication Date: Published on Mar 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.27341
• PDF: https://arxiv.org/pdf/2603.27341
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SurgicalAI #MedicalAI #FoundationModels #VisionLanguageModels #AIHealthcare
📝 Summary:
This paper finds that even state-of-the-art multi-billion parameter AI models struggle with surgical tool detection, a seemingly simple task. Scaling models further offers diminishing returns, suggesting fundamental limitations for current Vision Language Models in surgical use cases beyond just ...
🔹 Publication Date: Published on Mar 28
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.27341
• PDF: https://arxiv.org/pdf/2603.27341
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SurgicalAI #MedicalAI #FoundationModels #VisionLanguageModels #AIHealthcare
✨LinguDistill: Recovering Linguistic Ability in Vision- Language Models via Selective Cross-Modal Distillation
📝 Summary:
LinguDistill enables recovery of linguistic capabilities in vision-language models through adapter-free distillation using frozen language models as teachers, achieving performance close to pre-adapta...
🔹 Publication Date: Published on Apr 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.00829
• PDF: https://arxiv.org/pdf/2604.00829
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #NLP #ModelDistillation #ArtificialIntelligence #MachineLearning
📝 Summary:
LinguDistill enables recovery of linguistic capabilities in vision-language models through adapter-free distillation using frozen language models as teachers, achieving performance close to pre-adapta...
🔹 Publication Date: Published on Apr 1
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.00829
• PDF: https://arxiv.org/pdf/2604.00829
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisionLanguageModels #NLP #ModelDistillation #ArtificialIntelligence #MachineLearning
✨Vero: An Open RL Recipe for General Visual Reasoning
📝 Summary:
Vero is an open vision-language model family that achieves state-of-the-art visual reasoning performance through scaled reinforcement learning data across diverse tasks, demonstrating that broad data ...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04917
• PDF: https://arxiv.org/pdf/2604.04917
• Project Page: https://vero-reasoning.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisualReasoning #ReinforcementLearning #VisionLanguageModels #AIResearch #DeepLearning
📝 Summary:
Vero is an open vision-language model family that achieves state-of-the-art visual reasoning performance through scaled reinforcement learning data across diverse tasks, demonstrating that broad data ...
🔹 Publication Date: Published on Apr 6
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04917
• PDF: https://arxiv.org/pdf/2604.04917
• Project Page: https://vero-reasoning.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VisualReasoning #ReinforcementLearning #VisionLanguageModels #AIResearch #DeepLearning
✨VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.
🔹 Publication Date: Published on May 28, 2025
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG
🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI
✨CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.
🔹 Publication Date: Published on Apr 10
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
This media is not supported in your browser
VIEW IN TELEGRAM
✨SketchVLM: Vision language models can annotate images to explain thoughts and guide users
📝 Summary:
SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multipl...
🔹 Publication Date: Published on Apr 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22875
• PDF: https://arxiv.org/pdf/2604.22875
• Project Page: https://sketchvlm.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SketchVLM #VisionLanguageModels #ComputerVision #AI #ImageAnnotation
📝 Summary:
SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multipl...
🔹 Publication Date: Published on Apr 23
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22875
• PDF: https://arxiv.org/pdf/2604.22875
• Project Page: https://sketchvlm.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SketchVLM #VisionLanguageModels #ComputerVision #AI #ImageAnnotation
AI & ML Papers
Photo
🔥 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model
📅 Published on Oct 16, 2025
🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR ⭐ 77.1k
🤖 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/unsloth/PaddleOCR-VL
📊 Datasets citing this paper:
• https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning
💡 The paper proposes PaddleOCR-VL, a state-of-the-art and resource-efficient model for document parsing. The problem addressed is the need for a model that can accurately recognize elements in documents, such as text, tables, formulas, and charts, while being efficient in terms of resource consumption. To solve this problem, the authors propose a vision-language model that combines a NaViT-style dynamic resolution visual encoder with the ERNIE language model. The resulting model, PaddleOCR-VL-0.9B, is a compact yet powerful model that can support 109 languages and recognize complex elements with high accuracy. The method used to achieve this is the integration of the visual encoder and language model, which enables the model to efficiently process documents and recognize elements. The results show that PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition, outperforming existing solutions and exhibiting strong competitiveness against top-tier vision-language models. The model also delivers fast inference speeds, making it highly suitable for practical deployment in real-world scenarios. The code for the model is available, making it accessible for further research and development. Overall, the paper contributes a highly efficient and accurate model for document parsing, which can be used in a variety of applications.
📅 Published on Oct 16, 2025
🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR ⭐ 77.1k
🤖 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/unsloth/PaddleOCR-VL
📊 Datasets citing this paper:
• https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning
arXiv.org
PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B...
In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model...
❤3
🔥 MolmoAct2: Action Reasoning Models for Real-world Deployment
📅 Published on May 4
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02881
• PDF: https://arxiv.org/pdf/2605.02881
• Project Page: https://allenai.org/blog/molmoact2
• GitHub: https://github.com/allenai/molmoact2 ⭐ 90
🤖 Models citing this paper:
• https://huggingface.co/allenai/MolmoAct2
• https://huggingface.co/allenai/MolmoAct2-SO100_101
• https://huggingface.co/allenai/Molmo2-ER
📊 Datasets citing this paper:
• https://huggingface.co/datasets/allenai/13122025-tool-04
• https://huggingface.co/datasets/allenai/13122025-cut-10
• https://huggingface.co/datasets/allenai/28112025-yam-01
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/allenai/dataset-stats
• https://huggingface.co/spaces/allenai/lerobot-visualizer-v3
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#RoboticsActionReasoning #VisionLanguageModels #EmbodiedAI #BimanualRobotics #SpatialReasoning
💡 The paper presents MolmoAct2, an open action reasoning model for robotics that improves upon previous systems in several ways. Current vision-language-action models aim to provide a single generalist controller for robots, but they have limitations, such as being closed, requiring expensive hardware, or having high latency. MolmoAct2 addresses these issues by introducing several new components, including a specialized vision-language-model backbone called MolmoER, which is trained on a large corpus of data and is designed for spatial and embodied reasoning. The model also includes three new datasets, including the largest open bimanual dataset to date, and an open-weight action tokenizer called OpenFAST. The architecture of the model has been redesigned to include a continuous-action expert and an adaptive-depth reasoning variant called MolmoThink, which reduces latency by only re-predicting depth tokens for scene regions that change between timesteps. The results of the paper show that MolmoAct2 outperforms strong baselines in several simulation and real-world benchmarks, and the model weights, training code, and training data are released for use by others. Overall, MolmoAct2 is a fully open action reasoning model that is designed for practical deployment and advances the state of the art in robotics.
📅 Published on May 4
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02881
• PDF: https://arxiv.org/pdf/2605.02881
• Project Page: https://allenai.org/blog/molmoact2
• GitHub: https://github.com/allenai/molmoact2 ⭐ 90
🤖 Models citing this paper:
• https://huggingface.co/allenai/MolmoAct2
• https://huggingface.co/allenai/MolmoAct2-SO100_101
• https://huggingface.co/allenai/Molmo2-ER
📊 Datasets citing this paper:
• https://huggingface.co/datasets/allenai/13122025-tool-04
• https://huggingface.co/datasets/allenai/13122025-cut-10
• https://huggingface.co/datasets/allenai/28112025-yam-01
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/allenai/dataset-stats
• https://huggingface.co/spaces/allenai/lerobot-visualizer-v3
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#RoboticsActionReasoning #VisionLanguageModels #EmbodiedAI #BimanualRobotics #SpatialReasoning
arXiv.org
MolmoAct2: Action Reasoning Models for Real-world Deployment
Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models...
AI & ML Papers
Photo
🔥 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models
📅 Published on May 11
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10903
• PDF: https://arxiv.org/pdf/2605.10903
• Project Page: https://capvector.github.io/
• GitHub: https://github.com/OpenHelix-Team/CapVector ⭐ 26
🤖 Models citing this paper:
• https://huggingface.co/haofuly/capvector_models_collection
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ParametricSpaceLearning #TransferableCapabilities #VisionLanguageAction #MultimodalLearning
💡 This paper proposes a novel approach called CapVector to improve the performance of vision-language-action models. The problem addressed is that pre-trained models often fail to improve performance and reduce adaptation costs during standard supervised finetuning. Advanced finetuning methods with auxiliary training objectives can improve performance but incur significant computational overhead.
The proposed method decouples the auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead. This is achieved by training the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters difference between the two models is interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pre-trained parameters to form a capability-enhanced meta model.
The method also uses a lightweight orthogonal regularization loss to augment standard supervised finetuning, which reduces computational overhead. The results show that the capability vectors are effective and versatile across diverse models, and can generalize to novel environments and embodiments without additional training. The proposed approach achieves performance comparable to auxiliary finetuned baselines with reduced computational overhead, making it a promising solution for improving vision-language-action models.
📅 Published on May 11
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10903
• PDF: https://arxiv.org/pdf/2605.10903
• Project Page: https://capvector.github.io/
• GitHub: https://github.com/OpenHelix-Team/CapVector ⭐ 26
🤖 Models citing this paper:
• https://huggingface.co/haofuly/capvector_models_collection
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ParametricSpaceLearning #TransferableCapabilities #VisionLanguageAction #MultimodalLearning
arXiv.org
CapVector: Learning Transferable Capability Vectors in Parametric...
This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised...
AI & ML Papers
Photo
🔥 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture
📅 Published on May 12
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12500
• PDF: https://arxiv.org/pdf/2605.12500
• GitHub: https://github.com/OpenSenseNova/SenseNova-U1 ⭐ 1.6k
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalUnderstanding #NEOunifyArchitecture #VisionLanguageModels #MultimodalGeneration #UnifiedIntelligenceModels
💡 The paper introduces SenseNova-U1, a unified multimodal model that integrates understanding and generation into a single process, overcoming the traditional divide between these two tasks. Current large vision-language models treat understanding and generation as separate problems, leading to fragmented architectures and misaligned representation spaces. The authors argue that this divide hinders the emergence of native multimodal intelligence and propose a new paradigm, NEO-unify, which views understanding and generation as synergistic aspects of a single process.
The authors present two variants of SenseNova-U1, built on dense and mixture-of-experts understanding baselines, and demonstrate their performance across various tasks, including text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. The models also excel in image synthesis, infographic generation, and interleaved vision-language generation, showing strong semantic consistency and visual fidelity.
The paper provides detailed information on model design, data preprocessing, pre- and post-training, and inference strategies, supporting community research. The results show that SenseNova-U1 models perform strongly in vision-language-action and world model scenarios, indicating a broader roadmap where models can think and act across modalities in a native manner. The authors conclude that multimodal AI should focus on building a unified system, rather than connecting separate systems, allowing necessary capabilities to emerge from within. Overall, the paper contributes to the development of unified multimodal models that can integrate understanding and generation, paving the way for more advanced and native multimodal intelligence.
📅 Published on May 12
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12500
• PDF: https://arxiv.org/pdf/2605.12500
• GitHub: https://github.com/OpenSenseNova/SenseNova-U1 ⭐ 1.6k
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalUnderstanding #NEOunifyArchitecture #VisionLanguageModels #MultimodalGeneration #UnifiedIntelligenceModels
arXiv.org
SenseNova-U1: Unifying Multimodal Understanding and Generation...
Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented...
AI & ML Papers
Photo
🔥 dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
📅 Published on Dec 2, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning
💡 The paper introduces dots.ocr, a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing by jointly learning layout detection, text recognition, and relational understanding. The current methods for document layout parsing rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. The proposed model addresses this issue by using a single Vision-Language Model that jointly learns the three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, enabling the model to deliver robust performance across a wide array of tasks, languages, layouts, and domains. The model is validated on the OmniDocBench and XDocParse benchmarks, with the latter being a new challenging benchmark introduced in the paper that spans 126 languages. The results show that dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a 7.4 point margin and proving its unparalleled multilingual capabilities. The paper's contributions include the introduction of a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing, the creation of a new benchmark for multilingual document intelligence, and the demonstration of the advantages of jointly learning layout detection, text recognition, and relational understanding within a single model.
📅 Published on Dec 2, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1
AI & ML Papers
Photo
🔥 Unlocking Dense Metric Depth Estimation in VLMs
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/
🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision
💡 The paper proposes DepthVLM, a framework that enhances Vision-Language Models with dense geometry prediction capabilities. Vision-Language Models are limited in 3D understanding due to their text-only supervision paradigm, which prevents the recovery of dense geometry. Prior methods have limitations such as error accumulation or inefficient prediction. DepthVLM addresses this by attaching a lightweight depth head to the model backbone and training it under a unified vision-text supervision paradigm with a two-stage schedule. This allows the model to generate full-resolution depth maps alongside language outputs in a single forward pass. The authors also introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. The results show that DepthVLM significantly outperforms existing Vision-Language Models, surpasses leading pure vision models, and improves complex 3D spatial reasoning, making it a step toward a truly unified foundation model. The code and checkpoints will be publicly released, making it accessible for further research and development. Overall, DepthVLM provides a simple yet effective solution for dense metric depth estimation in Vision-Language Models, unlocking their potential for 3D understanding and spatial reasoning.
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/
🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤2
AI & ML Papers
Photo
🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25979
• PDF: https://arxiv.org/pdf/2605.25979
• Project Page: https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalLearning #VisionLanguageModels #VideoContentUnderstanding #PerceptualIntelligence #CodecStreamTokenization
💡 The paper introduces LLaVA-OneVision-2, a vision-language model that achieves superior performance across various multimodal benchmarks. The problem addressed is the need for a more capable model that can efficiently process and understand video content. The method used to achieve this is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream and allocates a limited token budget to event-bearing content. This approach enables more stable long-video token compression than fixed groups of pictures. The model also incorporates windowed attention for efficient local computation and a shared 3D RoPE to place codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system.
The model was trained using large-scale open supervision, with approximately 8 million re-captioned video samples for pretraining and a 4 million sample spatial corpus for fine-tuning. The paper also introduces JumpScore, a temporal-localization benchmark that targets fine-grained grounding in high-frequency, densely repeated motion. The results show that LLaVA-OneVision-2 outperforms existing models, including Qwen3-VL-8B, by a significant margin. On the JumpScore benchmark, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B by 44.8 points. The model also outperforms Qwen3-VL-8B by 4.3 average points on video tasks, 5.3 on spatial tasks, and 15.6 average J&F on tracking tasks.
The key contributions of the paper are the introduction of codec-stream tokenization, windowed attention, and large-scale open supervision, which enable the model to achieve superior performance across a broad range of multimodal benchmarks. The paper also highlights the importance of unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. Overall, the paper demonstrates the effectiveness of LLaVA-OneVision-2 in achieving next-generation perceptual intelligence.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25979
• PDF: https://arxiv.org/pdf/2605.25979
• Project Page: https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalLearning #VisionLanguageModels #VideoContentUnderstanding #PerceptualIntelligence #CodecStreamTokenization
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
💡 The paper addresses the issue of modality sensitivity in vision-language models, which occurs when a model's performance degrades significantly when the modality of the input is changed, such as replacing a textual question with its rendered-image counterpart. This problem arises due to the inherent bias in current training corpora, where text and images are typically organized into distinct and asymmetric roles. To address this issue, the authors propose Local Modality Substitution, a data curation approach that provides supervision for cross-modal representational invariance between semantically equivalent text and image carriers. This method reformulates single-modality prompts into seamlessly interleaved multimodal sequences by dynamically selecting target text spans and recasting them as rendered images, thereby preserving the same semantics across different carriers. The authors evaluate their approach on 13 diverse multimodal benchmarks and demonstrate that it significantly improves overall multimodal reasoning and yields deeper cross-modal fusion, achieving consistent gains across foundational models. Specifically, the approach delivers improvements of 2.67 points on one model and 2.82 points on another, compared to standard methods. The proposed method is lightweight and architecture-agnostic, making it a valuable contribution to the field of vision-language models.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 VLM3: Vision Language Models Are Native 3D Learners
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels
💡 The paper VLM3 Vision Language Models Are Native 3D Learners presents a study that challenges the common approach to 3D understanding tasks in computer vision. Typically these tasks rely on specialized vision models with complex designs and extensive data augmentation. However the authors argue that vision language models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training.
The problem addressed in this paper is that 3D understanding tasks such as depth estimation and object-level 3D understanding are currently dominated by expert vision models that have complex task-specific designs. The authors propose that vision language models can be native 3D learners and achieve comparable performance to these specialized models.
The method used in this study involves making three simple modifications to standard vision language models. These modifications include focal length unification, text-based pixel reference, and data mixture and scaling. The authors propose VLM3, a scalable method that enables standard vision language models to master diverse 3D tasks without requiring complex designs or extensive data augmentation.
The results of the study show that VLM3 advances the depth estimation accuracy of vision language models by a large margin, from 0.84 to 0.9. Additionally, VLM3 enables diverse 3D tasks such as pixel correspondence, camera pose estimation, and object-level 3D understanding, matching the accuracy of expert vision models while maintaining standard architectures and text-based training. Overall, the paper presents a new paradigm for simple and scalable 3D learning, demonstrating that vision language models can be effective native 3D learners.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning
📅 Published on Jun 11
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13673
• PDF: https://arxiv.org/pdf/2606.13673
• Project Page: https://spatialclaw.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpatialReasoning #VisionLanguageModels #AgenticInterfaces #SpatialArtificialIntelligence #CodeBasedActionInterfaces
💡 The paper introduces SpatialClaw, a training-free framework that enables flexible and stateful spatial reasoning in vision-language models. The problem addressed is the limitation of current spatial agents in performing open-ended spatial reasoning tasks, which is due to the design of the action interface that invokes specialist perception modules. Existing spatial agents use either single-pass code execution or a structured tool-call interface, both of which offer limited flexibility for complex 3D/4D spatial reasoning.
The proposed SpatialClaw framework uses code as the action interface, allowing a vision-language model-backed agent to write executable code conditioned on prior outputs. This approach enables the agent to flexibly compose and manipulate perception results and adapt its analysis to intermediate text and visual observations. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives.
The results show that SpatialClaw achieves superior performance across diverse 3D/4D spatial reasoning tasks, with an average accuracy of 59.9% across 20 benchmarks. This represents a significant improvement of 11.2 points over the recent spatial agent, with consistent gains across six vision-language model backbones from two model families, without any benchmark- or model-specific adaptation. The paper's contribution is the introduction of a flexible and effective framework for spatial reasoning that can be applied to a wide range of tasks without requiring training or adaptation.
📅 Published on Jun 11
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13673
• PDF: https://arxiv.org/pdf/2606.13673
• Project Page: https://spatialclaw.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpatialReasoning #VisionLanguageModels #AgenticInterfaces #SpatialArtificialIntelligence #CodeBasedActionInterfaces
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.