AI & ML Papers

🔥 Perceive-to-Reason: Decoupling Perception and Reasoning for Fine-Grained Visual Reasoning

💡 The paper introduces a unified framework called Perceive-to-Reason that improves fine-grained visual reasoning performance on high-resolution images. Fine-grained visual reasoning is a challenging task for vision-language models, especially when small but critical visual cues are buried in high-resolution images. Existing approaches typically do not explicitly distinguish between perception and reasoning, instead relying on repeated cropping or test-time visual search to introduce local evidence.

The Perceive-to-Reason framework addresses this limitation by formulating fine-grained visual reasoning as a two-stage process. In the first stage, the model localizes question-relevant evidence as a Perceiver, and in the second stage, it answers the question as a Reasoner based on the annotated image and cropped regions. To train the model, the authors introduce a role-aware reinforcement learning strategy called Perception-Reasoning Alternating GRPO, which alternates between perception-focused and reasoning-focused updates using only final-answer supervision.

The Perceive-to-Reason framework is built on top of existing vision-language models, and it consistently improves performance across model scales. The results show that the Perceive-to-Reason framework achieves state-of-the-art performance on several benchmarks, including V-Star, HR-Bench-4K, and HR-Bench-8K. Specifically, the P2R-4B model achieves 93.2 percent on V-Star, 81.9 percent on HR-Bench-4K, and 80.5 percent on HR-Bench-8K, substantially outperforming its corresponding backbone.

The benefits of the Perceive-to-Reason framework extend beyond high-resolution benchmarks to broader multimodal reasoning tasks. The results suggest that explicitly decoupling perception from reasoning provides an effective framework for fine-grained visual reasoning. Overall, the paper contributes a novel framework for fine-grained visual reasoning that improves performance on high-resolution images and has broader implications for multimodal reasoning tasks.

📅 Published on Jul 1

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.01191
• PDF: https://arxiv.org/pdf/2607.01191

🤖 Models citing this paper:
• https://huggingface.co/hongxingli/P2R-4B
• https://huggingface.co/hongxingli/P2R-2B
• https://huggingface.co/hongxingli/P2R-8B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/hongxingli/P2R-10k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FineGrainedVisualReasoning #VisualReasoningModels #PerceptionAndReasoning #HighResolutionImageAnalysis #VisionLanguageModels

GitHub

Hugging Face

The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.

522 views15:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform