AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 LLaVA-OneVision-2: Towards Next-Generation Perceptual Intelligence

💡 The paper introduces LLaVA-OneVision-2, a vision-language model that achieves superior performance across various multimodal benchmarks. The problem addressed is the need for a more capable model that can efficiently process and understand video content. The method used to achieve this is codec-stream tokenization, which treats compressed video as a continuous bit-cost stream and allocates a limited token budget to event-bearing content. This approach enables more stable long-video token compression than fixed groups of pictures. The model also incorporates windowed attention for efficient local computation and a shared 3D RoPE to place codec canvases, sampled frames, and images in a unified spatiotemporal coordinate system.

The model was trained using large-scale open supervision, with approximately 8 million re-captioned video samples for pretraining and a 4 million sample spatial corpus for fine-tuning. The paper also introduces JumpScore, a temporal-localization benchmark that targets fine-grained grounding in high-frequency, densely repeated motion. The results show that LLaVA-OneVision-2 outperforms existing models, including Qwen3-VL-8B, by a significant margin. On the JumpScore benchmark, LLaVA-OneVision-2-8B reaches 74.9 JumpScore mAP, surpassing Qwen3-VL-8B by 44.8 points. The model also outperforms Qwen3-VL-8B by 4.3 average points on video tasks, 5.3 on spatial tasks, and 15.6 average J&F on tracking tasks.

The key contributions of the paper are the introduction of codec-stream tokenization, windowed attention, and large-scale open supervision, which enable the model to achieve superior performance across a broad range of multimodal benchmarks. The paper also highlights the importance of unified perception across video understanding, temporal grounding, spatial grounding, and manipulation-trace reasoning. Overall, the paper demonstrates the effectiveness of LLaVA-OneVision-2 in achieving next-generation perceptual intelligence.


📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25979
• PDF: https://arxiv.org/pdf/2605.25979
• Project Page: https://evolvinglmms-lab.github.io/LLaVA-OneVision-2/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisionLanguageModels #VideoContentUnderstanding #PerceptualIntelligence #CodecStreamTokenization