✨Puzzle Curriculum GRPO for Vision-Centric Reasoning
📝 Summary:
Puzzle Curriculum GRPO PC-GRPO improves VLM visual reasoning without annotations. It uses self-supervised puzzle environments for verifiable rewards and a difficulty-aware curriculum to enhance consistency and accuracy.
🔹 Publication Date: Published on Dec 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.14944
• PDF: https://arxiv.org/pdf/2512.14944
• Project Page: https://pcgrpo.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VLM #VisualReasoning #SelfSupervisedLearning #ComputerVision #AI
📝 Summary:
Puzzle Curriculum GRPO PC-GRPO improves VLM visual reasoning without annotations. It uses self-supervised puzzle environments for verifiable rewards and a difficulty-aware curriculum to enhance consistency and accuracy.
🔹 Publication Date: Published on Dec 16
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.14944
• PDF: https://arxiv.org/pdf/2512.14944
• Project Page: https://pcgrpo.github.io/
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VLM #VisualReasoning #SelfSupervisedLearning #ComputerVision #AI
❤1
✨DeFM: Learning Foundation Representations from Depth for Robotics
📝 Summary:
DeFM is a self-supervised foundation model for depth representation learning in robotics. It learns geometric and semantic features from 60M depth images, achieving state-of-the-art performance across diverse robotic tasks and strong sim-to-real generalization.
🔹 Publication Date: Published on Jan 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.18923
• PDF: https://arxiv.org/pdf/2601.18923
• Github: https://de-fm.github.io/
🔹 Models citing this paper:
• https://huggingface.co/leggedrobotics/defm
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#Robotics #FoundationModels #SelfSupervisedLearning #ComputerVision #MachineLearning
📝 Summary:
DeFM is a self-supervised foundation model for depth representation learning in robotics. It learns geometric and semantic features from 60M depth images, achieving state-of-the-art performance across diverse robotic tasks and strong sim-to-real generalization.
🔹 Publication Date: Published on Jan 26
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.18923
• PDF: https://arxiv.org/pdf/2601.18923
• Github: https://de-fm.github.io/
🔹 Models citing this paper:
• https://huggingface.co/leggedrobotics/defm
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#Robotics #FoundationModels #SelfSupervisedLearning #ComputerVision #MachineLearning
❤1
✨OmniRad: A Radiological Foundation Model for Multi-Task Medical Image Analysis
📝 Summary:
OmniRad is a self-supervised radiological foundation model pretrained on 1.2 million medical images. It improves classification F1 by 2.05 percent and achieves better segmentation through representation reuse and cross-task transferability.
🔹 Publication Date: Published on Feb 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.04547
• PDF: https://arxiv.org/pdf/2602.04547
• Github: https://github.com/unica-visual-intelligence-lab/OmniRad
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MedicalAI #FoundationModels #Radiology #SelfSupervisedLearning #MedicalImaging
📝 Summary:
OmniRad is a self-supervised radiological foundation model pretrained on 1.2 million medical images. It improves classification F1 by 2.05 percent and achieves better segmentation through representation reuse and cross-task transferability.
🔹 Publication Date: Published on Feb 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.04547
• PDF: https://arxiv.org/pdf/2602.04547
• Github: https://github.com/unica-visual-intelligence-lab/OmniRad
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MedicalAI #FoundationModels #Radiology #SelfSupervisedLearning #MedicalImaging
✨OmniVideo-R1: Reinforcing Audio-visual Reasoning with Query Intention and Modality Attention
📝 Summary:
OmniVideo-R1 is a reinforced framework that enhances audio-visual understanding. It uses self-supervised query-intensive grounding and contrastive modality-attentive fusion. Experiments show OmniVideo-R1 consistently outperforms baselines, demonstrating its effectiveness.
🔹 Publication Date: Published on Feb 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.05847
• PDF: https://arxiv.org/pdf/2602.05847
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AudioVisualAI #SelfSupervisedLearning #DeepLearning #MultimodalAI #AIResearch
📝 Summary:
OmniVideo-R1 is a reinforced framework that enhances audio-visual understanding. It uses self-supervised query-intensive grounding and contrastive modality-attentive fusion. Experiments show OmniVideo-R1 consistently outperforms baselines, demonstrating its effectiveness.
🔹 Publication Date: Published on Feb 5
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.05847
• PDF: https://arxiv.org/pdf/2602.05847
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AudioVisualAI #SelfSupervisedLearning #DeepLearning #MultimodalAI #AIResearch
✨Learning Cross-View Object Correspondence via Cycle-Consistent Mask Prediction
📝 Summary:
This paper presents a conditional binary segmentation framework for robust cross-view object correspondence. It uses cycle-consistency training to create view-invariant representations without ground-truth annotations. This approach achieves state-of-the-art performance on relevant benchmarks.
🔹 Publication Date: Published on Feb 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.18996
• PDF: https://arxiv.org/pdf/2602.18996
• Github: https://github.com/shannany0606/CCMP
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#ComputerVision #MachineLearning #ObjectCorrespondence #ImageSegmentation #SelfSupervisedLearning
📝 Summary:
This paper presents a conditional binary segmentation framework for robust cross-view object correspondence. It uses cycle-consistency training to create view-invariant representations without ground-truth annotations. This approach achieves state-of-the-art performance on relevant benchmarks.
🔹 Publication Date: Published on Feb 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.18996
• PDF: https://arxiv.org/pdf/2602.18996
• Github: https://github.com/shannany0606/CCMP
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#ComputerVision #MachineLearning #ObjectCorrespondence #ImageSegmentation #SelfSupervisedLearning
❤1
Media is too big
VIEW IN TELEGRAM
✨Latent Particle World Models: Self-supervised Object-centric Stochastic Dynamics Modeling
📝 Summary:
LPWM is a self-supervised object-centric world model that autonomously discovers object representations from video data. It models stochastic particle dynamics for decision-making, achieving state-of-the-art results.
🔹 Publication Date: Published on Mar 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.04553
• PDF: https://arxiv.org/pdf/2603.04553
• Project Page: https://taldatech.github.io/lpwm-web/
• Github: https://github.com/taldatech/lpwm
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#WorldModels #SelfSupervisedLearning #ObjectCentricAI #MachineLearning #AI
📝 Summary:
LPWM is a self-supervised object-centric world model that autonomously discovers object representations from video data. It models stochastic particle dynamics for decision-making, achieving state-of-the-art results.
🔹 Publication Date: Published on Mar 4
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.04553
• PDF: https://arxiv.org/pdf/2603.04553
• Project Page: https://taldatech.github.io/lpwm-web/
• Github: https://github.com/taldatech/lpwm
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#WorldModels #SelfSupervisedLearning #ObjectCentricAI #MachineLearning #AI
❤1
✨A Mixed Diet Makes DINO An Omnivorous Vision Encoder
📝 Summary:
The Omnivorous Vision Encoder learns modality-agnostic features by aligning multi-modal scene inputs and distilling semantics from a frozen teacher model. This resolves poor cross-modal alignment in existing encoders, yielding consistent, powerful embeddings for various modalities.
🔹 Publication Date: Published on Feb 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.24181
• PDF: https://arxiv.org/pdf/2602.24181
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #SelfSupervisedLearning #AIResearch
📝 Summary:
The Omnivorous Vision Encoder learns modality-agnostic features by aligning multi-modal scene inputs and distilling semantics from a frozen teacher model. This resolves poor cross-modal alignment in existing encoders, yielding consistent, powerful embeddings for various modalities.
🔹 Publication Date: Published on Feb 27
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2602.24181
• PDF: https://arxiv.org/pdf/2602.24181
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#MultimodalAI #ComputerVision #DeepLearning #SelfSupervisedLearning #AIResearch
❤1
✨V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...
🔹 Publication Date: Published on Mar 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding
📝 Summary:
V-JEPA 2.1 is a self-supervised model learning dense visual representations for images and videos. It combines dense predictive loss, deep self-supervision, multi-modal tokenizers, and scaling to achieve state-of-the-art performance across various benchmarks, significantly advancing visual unders...
🔹 Publication Date: Published on Mar 15
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.14482
• PDF: https://arxiv.org/pdf/2603.14482
• Project Page: https://ai.meta.com/blog/v-jepa-2-world-model-benchmarks/
• Github: https://github.com/facebookresearch/vjepa2
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#SelfSupervisedLearning #ComputerVision #DeepLearning #AI #VideoUnderstanding
✨V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
📝 Summary:
V-JEPA 2 uses self-supervised learning on web videos and minimal robot data. It excels at video understanding, anticipation, Q&A, and zero-shot robotic planning. This approach yields a powerful world model for physical world planning.
🔹 Publication Date: Published on Jun 11, 2025
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/v-jepa-2-self-supervised-video-models-enable-understanding-prediction-and-planning
• PDF: https://arxiv.org/pdf/2506.09985
• Github: https://github.com/facebookresearch/vjepa2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ckadirt/vjxla
✨ Spaces citing this paper:
• https://huggingface.co/spaces/vselvarajijay/vjepa2-latent-prediction
• https://huggingface.co/spaces/aavi21458/vjepa2-latent-prediction
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AI #SelfSupervisedLearning #VideoAI #Robotics #WorldModels
📝 Summary:
V-JEPA 2 uses self-supervised learning on web videos and minimal robot data. It excels at video understanding, anticipation, Q&A, and zero-shot robotic planning. This approach yields a powerful world model for physical world planning.
🔹 Publication Date: Published on Jun 11, 2025
🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/v-jepa-2-self-supervised-video-models-enable-understanding-prediction-and-planning
• PDF: https://arxiv.org/pdf/2506.09985
• Github: https://github.com/facebookresearch/vjepa2
✨ Datasets citing this paper:
• https://huggingface.co/datasets/ckadirt/vjxla
✨ Spaces citing this paper:
• https://huggingface.co/spaces/vselvarajijay/vjepa2-latent-prediction
• https://huggingface.co/spaces/aavi21458/vjepa2-latent-prediction
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#AI #SelfSupervisedLearning #VideoAI #Robotics #WorldModels
Arxivexplained
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning - Explained Simply
By Mido Assran, Adrien Bardes, David Fan et al.. # V-JEPA 2: Teaching AI to Understand and Act in the Real World
**The Big Problem:** Current AI sys...
**The Big Problem:** Current AI sys...
🔥1
✨Ghost-FWL: A Large-Scale Full-Waveform LiDAR Dataset for Ghost Detection and Removal
📝 Summary:
This paper introduces Ghost-FWL, the first large-scale full-waveform LiDAR dataset for ghost point detection and removal. It leverages FWL data and a self-supervised learning approach to significantly improve LiDAR-based SLAM and 3D object detection accuracy by effectively removing false reflecti...
🔹 Publication Date: Published on Mar 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.28224
• PDF: https://arxiv.org/pdf/2603.28224
• Project Page: https://keio-csg.github.io/Ghost-FWL/
• Github: https://github.com/Keio-CSG/Ghost-FWL
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LiDAR #GhostDetection #SLAM #3DObjectDetection #SelfSupervisedLearning
📝 Summary:
This paper introduces Ghost-FWL, the first large-scale full-waveform LiDAR dataset for ghost point detection and removal. It leverages FWL data and a self-supervised learning approach to significantly improve LiDAR-based SLAM and 3D object detection accuracy by effectively removing false reflecti...
🔹 Publication Date: Published on Mar 30
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.28224
• PDF: https://arxiv.org/pdf/2603.28224
• Project Page: https://keio-csg.github.io/Ghost-FWL/
• Github: https://github.com/Keio-CSG/Ghost-FWL
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#LiDAR #GhostDetection #SLAM #3DObjectDetection #SelfSupervisedLearning
AI & ML Papers
Photo
🔥 Self-Supervised Prompt Optimization
📅 Published on Feb 7, 2025
🔗 Links:
• arXiv: https://arxiv.org/abs/2502.06855
• PDF: https://arxiv.org/pdf/2502.06855
• GitHub: https://github.com/geekan/metagpt ⭐ 67.7k
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/XiangJinYu/SPO
• https://huggingface.co/spaces/tang-x/SPO
• https://huggingface.co/spaces/ositamiles/SPO
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfSupervisedLearning #PromptOptimization #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelEvaluation
💡 The paper proposes a self supervised framework called Self Supervised Prompt Optimization that optimizes prompts for large language models without requiring external references. The problem addressed is that manually designed prompts require expertise and iterative experimentation, while existing prompt optimization methods rely heavily on external references such as ground truth or human evaluation, which can be costly to obtain. The proposed method derives evaluation and optimization signals purely from output comparisons, where a large language model evaluator selects superior prompts through pairwise output comparisons, and a large language model optimizer aligns outputs with task requirements. The results show that the proposed method outperforms state of the art prompt optimization methods, achieving comparable or superior results with significantly lower costs and fewer samples, demonstrating its effectiveness and efficiency. The method can optimize prompts for both closed and open ended tasks, and can be applied in real world scenarios where external references are unavailable or costly to obtain.
📅 Published on Feb 7, 2025
🔗 Links:
• arXiv: https://arxiv.org/abs/2502.06855
• PDF: https://arxiv.org/pdf/2502.06855
• GitHub: https://github.com/geekan/metagpt ⭐ 67.7k
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/XiangJinYu/SPO
• https://huggingface.co/spaces/tang-x/SPO
• https://huggingface.co/spaces/ositamiles/SPO
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfSupervisedLearning #PromptOptimization #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelEvaluation
arXiv.org
Self-Supervised Prompt Optimization
Well-designed prompts are crucial for enhancing Large language models' (LLMs) reasoning capabilities while aligning their outputs with task requirements across diverse domains. However, manually...
AI & ML Papers
Photo
🔥 Next-Latent Prediction Transformers Learn Compact World Models
📅 Published on Nov 8, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.05963
• PDF: https://arxiv.org/pdf/2511.05963
• Project Page: https://jaydenteoh.github.io/blog/2026/nextlat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JaydenTeoh/manhattan
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NextLatentPrediction #TransformerArchitectures #SelfSupervisedLearning #LatentStatePrediction #CompactWorldModels
💡 The paper introduces Next-Latent Prediction, a method that enhances transformer architectures by adding self-supervised latent state prediction to the standard next-token training. The problem with standard transformers is that they lack an incentive to compress history into compact latent states, leading to poor generalization. To address this, the authors propose Next-Latent Prediction, which trains a transformer to learn latent representations that can predict the next latent state given the next output token. This approach injects a recurrent inductive bias into transformers, encouraging them to form compact internal world models with their own belief states and transition dynamics. The method is simple and efficient, and it does not change the architecture, parallel training, or inference of the transformer. The authors show that this approach leads to significant gains in downstream accuracy, representation compression, and lookahead planning across various benchmarks, including world modeling, reasoning, planning, and language modeling. The results demonstrate that Next-Latent Prediction is a effective paradigm for shaping transformer representations toward stronger generalization.
📅 Published on Nov 8, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.05963
• PDF: https://arxiv.org/pdf/2511.05963
• Project Page: https://jaydenteoh.github.io/blog/2026/nextlat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JaydenTeoh/manhattan
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NextLatentPrediction #TransformerArchitectures #SelfSupervisedLearning #LatentStatePrediction #CompactWorldModels
GitHub
Hugging Face
The AI community building the future. Hugging Face has 443 repositories available. Follow their code on GitHub.
❤2