AI & ML Papers

🔥 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

💡 The paper introduces LLaVA-OneVision-2, a vision-language model that achieves superior performance across various multimodal benchmarks. The problem addressed is the need for a more capable model that can efficiently process and understand video content. The method used to achieve this is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream and allocates a limited token budget to event-bearing content. This approach enables more stable long-video token compression than fixed groups of pictures. The model also incorporates windowed attention for efficient local computation and a shared 3D RoPE to place codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system.

The model was trained using large-scale open supervision, with approximately 8 million re-captioned video samples for pretraining and a 4 million sample spatial corpus for fine-tuning. The paper also introduces JumpScore, a temporal-localization benchmark that targets fine-grained grounding in high-frequency, densely repeated motion. The results show that LLaVA-OneVision-2 outperforms existing models, including Qwen3-VL-8B, by a significant margin. On the JumpScore benchmark, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B by 44.8 points. The model also outperforms Qwen3-VL-8B by 4.3 average points on video tasks, 5.3 on spatial tasks, and 15.6 average J&F on tracking tasks.

The key contributions of the paper are the introduction of codec-stream tokenization, windowed attention, and large-scale open supervision, which enable the model to achieve superior performance across a broad range of multimodal benchmarks. The paper also highlights the importance of unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. Overall, the paper demonstrates the effectiveness of LLaVA-OneVision-2 in achieving next-generation perceptual intelligence.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25979
• PDF: https://arxiv.org/pdf/2605.25979
• Project Page: https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisionLanguageModels #VideoContentUnderstanding #PerceptualIntelligence #CodecStreamTokenization

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

546 views23:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform