AI & ML Papers – Telegram

AI & ML Papers

33.4K subscribers

7.17K photos

556 videos

24 files

7.87K links

Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho

Download Telegram

About

Blog

Apps

Platform

33.4K subscribers

✨VRAG-RL: Empower Vision-Perception-Based RAG for Visually Rich Information Understanding via Iterative Reasoning with Reinforcement Learning

📝 Summary:
VRAG-RL introduces a reinforcement learning framework to empower vision-language models for understanding visually rich information. It uses adaptive visual perception and query optimization to enhance retrieval and reasoning, overcoming limitations of current RAG methods.

🔹 Publication Date: Published on May 28, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2505.22019
• PDF: https://arxiv.org/pdf/2505.22019
• Github: https://github.com/Alibaba-NLP/VRAG

🔹 Models citing this paper:
• https://huggingface.co/Qiuchen-Wang/Qwen2.5-VL-7B-VRAG

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#RAG #ReinforcementLearning #VisionLanguageModels #ComputerVision #AI

250 views09:56

✨ Explore Data Science 📝 Write your paper

✨CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels

206 views02:01

✨ Explore Data Science 📝 Write your paper

This media is not supported in your browser

VIEW IN TELEGRAM

✨SketchVLM: Vision language models can annotate images to explain thoughts and guide users

📝 Summary:
SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multipl...

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22875
• PDF: https://arxiv.org/pdf/2604.22875
• Project Page: https://sketchvlm.github.io/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#SketchVLM #VisionLanguageModels #ComputerVision #AI #ImageAnnotation

219 views03:01

✨ Explore Data Science 📝 Write your paper

🔥 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

💡 The paper proposes PaddleOCR-VL, a state-of-the-art and resource-efficient model for document parsing. The problem addressed is the need for a model that can accurately recognize elements in documents, such as text, tables, formulas, and charts, while being efficient in terms of resource consumption. To solve this problem, the authors propose a vision-language model that combines a NaViT-style dynamic resolution visual encoder with the ERNIE language model. The resulting model, PaddleOCR-VL-0.9B, is a compact yet powerful model that can support 109 languages and recognize complex elements with high accuracy. The method used to achieve this is the integration of the visual encoder and language model, which enables the model to efficiently process documents and recognize elements. The results show that PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition, outperforming existing solutions and exhibiting strong competitiveness against top-tier vision-language models. The model also delivers fast inference speeds, making it highly suitable for practical deployment in real-world scenarios. The code for the model is available, making it accessible for further research and development. Overall, the paper contributes a highly efficient and accurate model for document parsing, which can be used in a variety of applications.

📅 Published on Oct 16, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR ⭐ 77.1k

🤖 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/unsloth/PaddleOCR-VL

📊 Datasets citing this paper:
• https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B...

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model...

❤3

466 views04:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 MolmoAct2: Action Reasoning Models for Real-world Deployment

💡 The paper presents MolmoAct2, an open action reasoning model for robotics that improves upon previous systems in several ways. Current vision-language-action models aim to provide a single generalist controller for robots, but they have limitations, such as being closed, requiring expensive hardware, or having high latency. MolmoAct2 addresses these issues by introducing several new components, including a specialized vision-language-model backbone called MolmoER, which is trained on a large corpus of data and is designed for spatial and embodied reasoning. The model also includes three new datasets, including the largest open bimanual dataset to date, and an open-weight action tokenizer called OpenFAST. The architecture of the model has been redesigned to include a continuous-action expert and an adaptive-depth reasoning variant called MolmoThink, which reduces latency by only re-predicting depth tokens for scene regions that change between timesteps. The results of the paper show that MolmoAct2 outperforms strong baselines in several simulation and real-world benchmarks, and the model weights, training code, and training data are released for use by others. Overall, MolmoAct2 is a fully open action reasoning model that is designed for practical deployment and advances the state of the art in robotics.

📅 Published on May 4

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02881
• PDF: https://arxiv.org/pdf/2605.02881
• Project Page: https://allenai.org/blog/molmoact2
• GitHub: https://github.com/allenai/molmoact2 ⭐ 90

🤖 Models citing this paper:
• https://huggingface.co/allenai/MolmoAct2
• https://huggingface.co/allenai/MolmoAct2-SO100_101
• https://huggingface.co/allenai/Molmo2-ER

📊 Datasets citing this paper:
• https://huggingface.co/datasets/allenai/13122025-tool-04
• https://huggingface.co/datasets/allenai/13122025-cut-10
• https://huggingface.co/datasets/allenai/28112025-yam-01

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/allenai/dataset-stats
• https://huggingface.co/spaces/allenai/lerobot-visualizer-v3

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticsActionReasoning #VisionLanguageModels #EmbodiedAI #BimanualRobotics #SpatialReasoning

MolmoAct2: Action Reasoning Models for Real-world Deployment

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models...

283 views04:59

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

💡 This paper proposes a novel approach called CapVector to improve the performance of vision-language-action models. The problem addressed is that pre-trained models often fail to improve performance and reduce adaptation costs during standard supervised finetuning. Advanced finetuning methods with auxiliary training objectives can improve performance but incur significant computational overhead.

The proposed method decouples the auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead. This is achieved by training the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters difference between the two models is interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pre-trained parameters to form a capability-enhanced meta model.

The method also uses a lightweight orthogonal regularization loss to augment standard supervised finetuning, which reduces computational overhead. The results show that the capability vectors are effective and versatile across diverse models, and can generalize to novel environments and embodiments without additional training. The proposed approach achieves performance comparable to auxiliary finetuned baselines with reduced computational overhead, making it a promising solution for improving vision-language-action models.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10903
• PDF: https://arxiv.org/pdf/2605.10903
• Project Page: https://capvector.github.io/
• GitHub: https://github.com/OpenHelix-Team/CapVector ⭐ 26

🤖 Models citing this paper:
• https://huggingface.co/haofuly/capvector_models_collection

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ParametricSpaceLearning #TransferableCapabilities #VisionLanguageAction #MultimodalLearning

CapVector: Learning Transferable Capability Vectors in Parametric...

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised...

283 views03:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

💡 The paper introduces SenseNova-U1, a unified multimodal model that integrates understanding and generation into a single process, overcoming the traditional divide between these two tasks. Current large vision-language models treat understanding and generation as separate problems, leading to fragmented architectures and misaligned representation spaces. The authors argue that this divide hinders the emergence of native multimodal intelligence and propose a new paradigm, NEO-unify, which views understanding and generation as synergistic aspects of a single process.

The authors present two variants of SenseNova-U1, built on dense and mixture-of-experts understanding baselines, and demonstrate their performance across various tasks, including text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. The models also excel in image synthesis, infographic generation, and interleaved vision-language generation, showing strong semantic consistency and visual fidelity.

The paper provides detailed information on model design, data preprocessing, pre- and post-training, and inference strategies, supporting community research. The results show that SenseNova-U1 models perform strongly in vision-language-action and world model scenarios, indicating a broader roadmap where models can think and act across modalities in a native manner. The authors conclude that multimodal AI should focus on building a unified system, rather than connecting separate systems, allowing necessary capabilities to emerge from within. Overall, the paper contributes to the development of unified multimodal models that can integrate understanding and generation, paving the way for more advanced and native multimodal intelligence.

📅 Published on May 12

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12500
• PDF: https://arxiv.org/pdf/2605.12500
• GitHub: https://github.com/OpenSenseNova/SenseNova-U1 ⭐ 1.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalUnderstanding #NEOunifyArchitecture #VisionLanguageModels #MultimodalGeneration #UnifiedIntelligenceModels

SenseNova-U1: Unifying Multimodal Understanding and Generation...

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented...

517 views13:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

💡 The paper introduces dots.ocr, a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing by jointly learning layout detection, text recognition, and relational understanding. The current methods for document layout parsing rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. The proposed model addresses this issue by using a single Vision-Language Model that jointly learns the three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, enabling the model to deliver robust performance across a wide array of tasks, languages, layouts, and domains. The model is validated on the OmniDocBench and XDocParse benchmarks, with the latter being a new challenging benchmark introduced in the paper that spans 126 languages. The results show that dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a 7.4 point margin and proving its unparalleled multilingual capabilities. The paper's contributions include the introduction of a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing, the creation of a new benchmark for multilingual document intelligence, and the demonstration of the advantages of jointly learning layout detection, text recognition, and relational understanding within a single model.

📅 Published on Dec 2, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤1

651 views01:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 Unlocking Dense Metric Depth Estimation in VLMs

💡 The paper proposes DepthVLM, a framework that enhances Vision-Language Models with dense geometry prediction capabilities. Vision-Language Models are limited in 3D understanding due to their text-only supervision paradigm, which prevents the recovery of dense geometry. Prior methods have limitations such as error accumulation or inefficient prediction. DepthVLM addresses this by attaching a lightweight depth head to the model backbone and training it under a unified vision-text supervision paradigm with a two-stage schedule. This allows the model to generate full-resolution depth maps alongside language outputs in a single forward pass. The authors also introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. The results show that DepthVLM significantly outperforms existing Vision-Language Models, surpasses leading pure vision models, and improves complex 3D spatial reasoning, making it a step toward a truly unified foundation model. The code and checkpoints will be publicly released, making it accessible for further research and development. Overall, DepthVLM provides a simple yet effective solution for dense metric depth estimation in Vision-Language Models, unlocking their potential for 3D understanding and spatial reasoning.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/

🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤2

598 views11:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.

📅 Published on Sep 30, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

718 views07:53

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

💡 The paper introduces LLaVA-OneVision-2, a vision-language model that achieves superior performance across various multimodal benchmarks. The problem addressed is the need for a more capable model that can efficiently process and understand video content. The method used to achieve this is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream and allocates a limited token budget to event-bearing content. This approach enables more stable long-video token compression than fixed groups of pictures. The model also incorporates windowed attention for efficient local computation and a shared 3D RoPE to place codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system.

The model was trained using large-scale open supervision, with approximately 8 million re-captioned video samples for pretraining and a 4 million sample spatial corpus for fine-tuning. The paper also introduces JumpScore, a temporal-localization benchmark that targets fine-grained grounding in high-frequency, densely repeated motion. The results show that LLaVA-OneVision-2 outperforms existing models, including Qwen3-VL-8B, by a significant margin. On the JumpScore benchmark, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B by 44.8 points. The model also outperforms Qwen3-VL-8B by 4.3 average points on video tasks, 5.3 on spatial tasks, and 15.6 average J&F on tracking tasks.

The key contributions of the paper are the introduction of codec-stream tokenization, windowed attention, and large-scale open supervision, which enable the model to achieve superior performance across a broad range of multimodal benchmarks. The paper also highlights the importance of unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. Overall, the paper demonstrates the effectiveness of LLaVA-OneVision-2 in achieving next-generation perceptual intelligence.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25979
• PDF: https://arxiv.org/pdf/2605.25979
• Project Page: https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisionLanguageModels #VideoContentUnderstanding #PerceptualIntelligence #CodecStreamTokenization

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

546 views23:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

💡 The paper addresses the issue of modality sensitivity in vision-language models, which occurs when a model's performance degrades significantly when the modality of the input is changed, such as replacing a textual question with its rendered-image counterpart. This problem arises due to the inherent bias in current training corpora, where text and images are typically organized into distinct and asymmetric roles. To address this issue, the authors propose Local Modality Substitution, a data curation approach that provides supervision for cross-modal representational invariance between semantically equivalent text and image carriers. This method reformulates single-modality prompts into seamlessly interleaved multimodal sequences by dynamically selecting target text spans and recasting them as rendered images, thereby preserving the same semantics across different carriers. The authors evaluate their approach on 13 diverse multimodal benchmarks and demonstrate that it significantly improves overall multimodal reasoning and yields deeper cross-modal fusion, achieving consistent gains across foundational models. Specifically, the approach delivers improvements of 2.67 points on one model and 2.82 points on another, compared to standard methods. The proposed method is lightweight and architecture-agnostic, making it a valuable contribution to the field of vision-language models.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

589 views15:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 VLM3: Vision Language Models Are Native 3D Learners

💡 The paper VLM3 Vision Language Models Are Native 3D Learners presents a study that challenges the common approach to 3D understanding tasks in computer vision. Typically these tasks rely on specialized vision models with complex designs and extensive data augmentation. However the authors argue that vision language models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training.

The problem addressed in this paper is that 3D understanding tasks such as depth estimation and object-level 3D understanding are currently dominated by expert vision models that have complex task-specific designs. The authors propose that vision language models can be native 3D learners and achieve comparable performance to these specialized models.

The method used in this study involves making three simple modifications to standard vision language models. These modifications include focal length unification, text-based pixel reference, and data mixture and scaling. The authors propose VLM3, a scalable method that enables standard vision language models to master diverse 3D tasks without requiring complex designs or extensive data augmentation.

The results of the study show that VLM3 advances the depth estimation accuracy of vision language models by a large margin, from 0.84 to 0.9. Additionally, VLM3 enables diverse 3D tasks such as pixel correspondence, camera pose estimation, and object-level 3D understanding, matching the accuracy of expert vision models while maintaining standard architectures and text-based training. Overall, the paper presents a new paradigm for simple and scalable 3D learning, demonstrating that vision language models can be effective native 3D learners.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

433 views13:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 SpatialClaw: Rethinking Action Interface for Agentic Spatial Reasoning

💡 The paper introduces SpatialClaw, a training-free framework that enables flexible and stateful spatial reasoning in vision-language models. The problem addressed is the limitation of current spatial agents in performing open-ended spatial reasoning tasks, which is due to the design of the action interface that invokes specialist perception modules. Existing spatial agents use either single-pass code execution or a structured tool-call interface, both of which offer limited flexibility for complex 3D/4D spatial reasoning.

The proposed SpatialClaw framework uses code as the action interface, allowing a vision-language model-backed agent to write executable code conditioned on prior outputs. This approach enables the agent to flexibly compose and manipulate perception results and adapt its analysis to intermediate text and visual observations. SpatialClaw maintains a stateful Python kernel pre-loaded with input frames and a suite of perception and geometry primitives.

The results show that SpatialClaw achieves superior performance across diverse 3D/4D spatial reasoning tasks, with an average accuracy of 59.9% across 20 benchmarks. This represents a significant improvement of 11.2 points over the recent spatial agent, with consistent gains across six vision-language model backbones from two model families, without any benchmark- or model-specific adaptation. The paper's contribution is the introduction of a flexible and effective framework for spatial reasoning that can be applied to a wide range of tasks without requiring training or adaptation.

📅 Published on Jun 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.13673
• PDF: https://arxiv.org/pdf/2606.13673
• Project Page: https://spatialclaw.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpatialReasoning #VisionLanguageModels #AgenticInterfaces #SpatialArtificialIntelligence #CodeBasedActionInterfaces

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

447 views19:53

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 JoyAI-VL-Interaction: Real-Time Vision-Language Interaction Intelligence

💡 The paper introduces a new paradigm for vision-language models, shifting from turn-based systems that require user prompting to a model that operates in real-time, making autonomous decisions about when to respond or delegate. The problem with current large models is that they only answer when addressed and do not interact in real-time, even in video-call apps. To address this, the authors propose a model that continuously watches what is happening and decides on its own whether to speak or stay silent.

The authors make two main contributions. First, they release JoyAI-VL-Interaction, an 8B-scale vision-first vision-language interaction model that makes the response decision internally, choosing each second to stay silent, respond, or delegate to a background model. The model excels at vision-triggered responsiveness and time awareness. They also provide a transferable training recipe that allows for capabilities to emerge that were not explicitly trained for, such as guiding a shopper through changing app screens or improvising a lecture from a slide deck.

Second, they release a complete deployable system built around the model, which streams any ongoing video into the model, making it genuinely present in the world. The system has pluggable components, including ASR/TTS modules, memory, visualization UI, and a background brain that can connect to any API or agent.

The results show that human raters prefer JoyAI-VL-Interaction over in-app video-call assistants by a wide margin across six real-world scenarios. This is the first open, vision-driven interaction model released together with its training recipe, data, and complete deployable system, making it a significant contribution to the field of interaction models.

📅 Published on Jun 10

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.14777
• PDF: https://arxiv.org/pdf/2606.14777
• Project Page: https://joyai-vl-video-future-academy-jd.github.io/JoyAI-VL-Interaction/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #RealTimeInteraction #AutonomousDecisionMaking #VisionFirstApproach #MultimodalIntelligence

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

🔥1

612 views18:20

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 olmOCR: Unlocking Trillions of Tokens in PDFs with Vision Language Models

💡 The paper presents olmOCR, an open source toolkit that uses a fine tuned vision language model to extract clean text from PDF documents while preserving their structure. The problem addressed is that PDFs come in diverse formats and visual layouts, making it challenging to extract and represent their content for language model use. The method involves using a 7 billion parameter vision language model trained on a sample of 260,000 pages from over 100,000 crawled PDFs with diverse properties. The model is fine tuned to process PDFs into clean linearized plain text in natural reading order, preserving structured content such as sections, tables, lists, and equations. The results show that olmOCR is optimized for large scale batch processing, able to scale flexibly to different hardware setups, and can convert a million PDF pages for a relatively low cost of 190 USD. The toolkit is released as open source, including the vision language model weights, data, training code, and inference code, making it accessible for use in training language models with the trillions of tokens available in PDF documents. Overall, the paper contributes a scalable and cost effective solution for unlocking the content of PDF documents, which can be used to train high quality language models.

📅 Published on Feb 25, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2502.18443
• PDF: https://arxiv.org/pdf/2502.18443
• Project Page: https://olmocr.allenai.org/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/allenai/olmOCR-bench
• https://huggingface.co/datasets/shhdwi/olmocr-pre-rendered
• https://huggingface.co/datasets/Voxel51/olmOCR_bench

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/davanstrien/benchmark-race
• https://huggingface.co/spaces/OpenEvals/every-leaderboards
• https://huggingface.co/spaces/OpenEvals/leaderboard-watcher

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #PDFTextExtraction #DocumentLayoutAnalysis #OCRTechniques #NaturalLanguageProcessing

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤3

1.05K views11:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

💡 The paper introduces a unified framework called Perceive-to-Reason that improves fine-grained visual reasoning performance on high-resolution images. Fine-grained visual reasoning is a challenging task for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches typically do not explicitly distinguish between perception and reasoning, instead relying on repeated cropping or test-time visual search to introduce local evidence.

The Perceive-to-Reason framework addresses this limitation by formulating fine-grained visual reasoning as a two-stage process. In the first stage, the model localizes question-relevant evidence as a Perceiver, and in the second stage, it answers the question as a Reasoner based on the annotated image and cropped regions. To train the model, the authors introduce a role-aware reinforcement learning strategy called Perception-Reasoning Alternating GRPO, which alternates between perception-focused and reasoning-focused updates using only final-answer supervision.

The Perceive-to-Reason framework is built on top of existing vision-language models, and it consistently improves performance across model scales. The results show that the Perceive-to-Reason framework achieves state-of-the-art performance on several benchmarks, including V-Star, HR-Bench-4K, and HR-Bench-8K. Specifically, the P2R-4B model achieves 93.2 percent on V-Star, 81.9 percent on HR-Bench-4K, and 80.5 percent on HR-Bench-8K, substantially outperforming its corresponding backbone.

The benefits of the Perceive-to-Reason framework extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. The results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning. Overall, the paper contributes a novel framework for fine-grained visual reasoning that improves performance on high-resolution images and has broader implications for multimodal reasoning tasks.

📅 Published on Jul 1

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.01191
• PDF: https://arxiv.org/pdf/2607.01191

🤖 Models citing this paper:
• https://huggingface.co/hongxingli/P2R-4B
• https://huggingface.co/hongxingli/P2R-2B
• https://huggingface.co/hongxingli/P2R-8B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/hongxingli/P2R-10k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FineGrainedVisualReasoning #VisualReasoningModels #PerceptionAndReasoning #HighResolutionImageAnalysis #VisionLanguageModels

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

836 views15:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 MetaSpatial: Reinforcing 3D Spatial Reasoning in VLMs for the Metaverse

💡 MetaSpatial is a framework that uses reinforcement learning to improve 3D spatial reasoning in vision-language models, which are used to generate 3D scenes. The problem with current models is that they lack internalized 3D spatial reasoning, which limits their ability to generate realistic layouts. Additionally, traditional supervised fine-tuning methods are not effective for layout generation tasks because perfect ground truth annotations are not available.

To address these challenges, MetaSpatial introduces a multi-turn reinforcement learning-based optimization mechanism that integrates physics-aware constraints and rendered image evaluations. This mechanism allows the model to refine spatial arrangements over multiple turns by analyzing rendered outputs, improving scene coherence progressively.

The method works by having the model analyze rendered outputs and refine the spatial arrangements in an iterative process. This process ensures that the generated 3D layouts are coherent, physically plausible, and aesthetically consistent.

The results of the empirical evaluations demonstrate that MetaSpatial significantly enhances the spatial consistency and formatting stability of various scale models. After training, object placements are more realistic, aligned, and functionally coherent, which validates the effectiveness of reinforcement learning for 3D spatial reasoning in applications such as metaverse, AR/VR, digital twins, and game development.

Overall, the contributions of MetaSpatial are the introduction of a reinforcement learning-based framework that enhances 3D spatial reasoning in vision-language models, and the demonstration of its effectiveness in generating realistic and coherent 3D scenes. The code, data, and training pipeline are publicly available, which can facilitate further research and development in this area.

📅 Published on Mar 24, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2503.18470
• PDF: https://arxiv.org/pdf/2503.18470
• Project Page: https://github.com/PzySeere/MetaSpatial

📊 Datasets citing this paper:
• https://huggingface.co/datasets/johnschaefer/EasyR1-qwen3vl-rl
• https://huggingface.co/datasets/Yuting6/ttrl
• https://huggingface.co/datasets/zhenyupan/3d_layout_reasoning

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ReinforcementLearningFor3D #MetaverseArchitecture #3DSpatialReasoning #PhysicsAwareAI

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

971 views13:53

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 Xiaomi-Robotics-1: Scaling Vision-Language-Action Models with over 100K Hours of Real-World Trajectories

💡 The paper presents Xiaomi-Robotics-1, a foundational vision-language-action model that can follow diverse language instructions to perform a wide range of mobile manipulation tasks in unseen environments. The model is trained using a two-stage training recipe consisting of pre-training and post-training. During pre-training, the model is trained on over 100,000 hours of real-world manipulation trajectories collected via UM devices, and a scalable auto-labeling pipeline is developed to annotate trajectory clips with natural language descriptions of scene state transitions. This provides rich and precise conditioning for action learning.

During post-training, the model is fine-tuned to align with robot embodiment and imperative instructions that humans naturally use to prompt robots. The experiments demonstrate strong scaling behavior, with the model consistently improving with increased data scales and model sizes during pre-training. This scaling behavior directly transfers to post-training, where stronger pre-training models yield better out-of-the-box real-robot performance in unseen environments.

The results show that Xiaomi-Robotics-1 serves as a strong robot foundation policy that can be efficiently fine-tuned on complex, dexterous tasks with high data efficiency. Across multiple simulation benchmarks, Xiaomi-Robotics-1 outperforms state-of-the-art methods, establishing a new state-of-the-art with a 57.6% success rate on RoboCasa 365, surpassing the previous best of 46.6%. Additionally, it achieves an average score of 20.07 on RoboDojo, significantly outperforming the prior state-of-the-art. The code and model checkpoints will be released.

Overall, the paper contributes to the development of a robust and scalable vision-language-action model that can be applied to a wide range of robotic tasks, and demonstrates the effectiveness of the proposed two-stage training recipe and auto-labeling pipeline in achieving state-of-the-art performance.

📅 Published on Jul 16

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.15330
• PDF: https://arxiv.org/pdf/2607.15330
• Project Page: https://robotics.xiaomi.com/xiaomi-robotics-1.html

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #MobileManipulation #RoboticsLearning #RealWorldTrajectories #LanguageActionInterfaces

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

478 views23:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

🔥 Mage-VL: An Efficient Codec-Native Streaming Multimodal Foundation Model

💡 The paper presents Mage-VL, an efficient codec-native streaming multimodal foundation model for real-time multimodal understanding and interaction. Standard vision-language models suffer from Moravec's paradox, exceling at complex offline visual reasoning but struggling with simple streaming perception tasks and processing them inefficiently. To address this, the authors propose a custom tokenizer, Mage-ViT, which replaces uniform frame sampling with selective encoding of dynamic, entropy-rich regions using motion vectors and residual energy across sparse anchor and predicted frames. This approach reduces visual token consumption by over 75 percent while preserving spatiotemporal context.

Mage-ViT is trained from scratch on approximately 560 million unlabeled images and 100 million unlabeled video frames, and it matches or outperforms flagship encoders trained on billions of image-text pairs. The authors also establish AI4AI data pipelines encompassing prompt-code joint optimization for multimodal captioning and AI-driven performance diagnosis to guide training recipes.

Furthermore, through a bio-inspired dual-system architecture, consisting of a lightweight System 1 event gate and a causal System 2 decoder, Mage-VL enables proactive streaming perception. Extensive evaluations show that Mage-VL-4B matches Qwen3-VL-4B on static tasks while achieving strong gains in video understanding and 2D/3D spatial reasoning, with up to a 3.5 times wall-clock inference speedup, and comprehensively surpasses the 15B Phi-4-reasoning-vision baseline.

The paper delivers seven key empirical findings covering pre-training data efficiency, variable-resolution scaling, codec system acceleration, Video QA SF redundancy, motion-spatial synergy, AI4AI data pipelines, and Zero-Vision SFT for multimodal RL. Overall, the paper presents a novel approach to multimodal understanding and interaction, with significant improvements in efficiency, accuracy, and speed.

📅 Published on Jul 27

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.24904
• PDF: https://arxiv.org/pdf/2607.24904
• Project Page: https://microsoft.github.io/Mage

🤖 Models citing this paper:
• https://huggingface.co/microsoft/Mage-VL

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/microsoft/mage-vl-demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalFoundationModels #CodecNativeStreaming #VisionLanguageModels #RealTimeMultimodalUnderstanding #EfficientMultimodalProcessing

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

295 views19:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate