AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Structured Causal Video Reasoning via Multi-Objective Alignment

📝 Summary:
This paper introduces Structured Event Facts for explicit causal video reasoning, moving beyond unstructured methods. It uses a multi-objective reinforcement learning pipeline to balance training goals, leading to Factum-4B. This model achieves reliable, stronger performance on complex temporal v...

🔹 Publication Date: Published on Apr 6

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.04415
• PDF: https://arxiv.org/pdf/2604.04415

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#CausalAI #VideoReasoning #ReinforcementLearning #ComputerVision #AIResearch
3DTV: A Feedforward Interpolation Network for Real-Time View Synthesis

📝 Summary:
3DTV is a feedforward network combining lightweight geometry and learning for real-time, robust sparse-view interpolation. It generates novel views efficiently without scene-specific optimization, making it practical for interactive applications.

🔹 Publication Date: Published on Apr 13

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.11211
• PDF: https://arxiv.org/pdf/2604.11211
• Project Page: https://stefanmschulz.github.io/3DTV_webpage/
• Github: https://github.com/StefanMSchulz/3DTV

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#ViewSynthesis #DeepLearning #ComputerVision #NeuralNetworks #RealTimeAI
ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch
VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoEditing #VFX #AI #ComputerVision #Benchmarks
NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.

🔹 Publication Date: Published on Apr 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction
Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.

🔹 Publication Date: Published on Apr 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP
Media is too big
VIEW IN TELEGRAM
CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...

🔹 Publication Date: Published on Apr 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation
1
This media is not supported in your browser
VIEW IN TELEGRAM
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning
3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch
FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

📝 Summary:
FlowAnchor stabilizes inversion-free video editing by addressing signal instability in high-dimensional latent spaces. It uses spatial-aware attention refinement and adaptive magnitude modulation to ensure precise localization and sufficient editing strength, leading to faithful and coherent vide...

🔹 Publication Date: Published on Apr 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22586
• PDF: https://arxiv.org/pdf/2604.22586

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoEditing #DeepLearning #ComputerVision #GenerativeAI #AIResearch
This media is not supported in your browser
VIEW IN TELEGRAM
Video Analysis and Generation via a Semantic Progress Function

📝 Summary:
Researchers developed a Semantic Progress Function to analyze and correct non-linear semantic evolution in generated media. This function identifies uneven pacing, enabling a linearization procedure that re-times sequences for smoother, more coherent transitions at a constant semantic rate.

🔹 Publication Date: Published on Apr 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22554
• PDF: https://arxiv.org/pdf/2604.22554
• Project Page: https://sagipolaczek.github.io/semantic-progress-function/
• Github: https://github.com/SagiPolaczek/semantic-progress-function

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoAI #GenerativeAI #ComputerVision #SemanticAnalysis #AIResearch
This media is not supported in your browser
VIEW IN TELEGRAM
SketchVLM: Vision language models can annotate images to explain thoughts and guide users

📝 Summary:
SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multipl...

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22875
• PDF: https://arxiv.org/pdf/2604.22875
• Project Page: https://sketchvlm.github.io/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#SketchVLM #VisionLanguageModels #ComputerVision #AI #ImageAnnotation
Sapiens2

📝 Summary:
Sapiens2 is a high-resolution transformer model for human-centric vision. It achieves state-of-the-art performance by combining unified pretraining objectives, a large 1-billion image dataset, and architectural improvements, excelling in tasks like pose and segmentation.

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.21681
• PDF: https://arxiv.org/pdf/2604.21681
• Github: https://github.com/facebookresearch/sapiens2

🔹 Models citing this paper:
https://huggingface.co/facebook/sapiens2
https://huggingface.co/facebook/sapiens2-seg-5b
https://huggingface.co/facebook/sapiens2-seg-1b

Spaces citing this paper:
https://huggingface.co/spaces/facebook/sapiens2-seg
https://huggingface.co/spaces/facebook/sapiens2-pointmap
https://huggingface.co/spaces/facebook/sapiens2-normal

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#Sapiens2 #ComputerVision #TransformerModels #HumanCentricAI #DeepLearning
Probing Visual Planning in Image Editing Models

📝 Summary:
This paper redefines visual planning as a single-step image transformation using abstract puzzles for evaluation. Their EAR paradigm and AMAZE dataset reveal that current AI models, despite finetuning, cannot match human zero-shot efficiency, highlighting a gap in visual reasoning.

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22868
• PDF: https://arxiv.org/pdf/2604.22868
• Project Page: https://spatigen.github.io/amaze.io/
• Github: https://github.com/spatigen/amaze

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VisualPlanning #ImageEditing #ComputerVision #AIResearch #MachineLearning
AI & ML Papers
Photo
🔥 SAM 3: Segment Anything with Concepts

💡 The paper introduces Segment Anything Model 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts. The model achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization. The concept prompts can be short noun phrases, image exemplars, or a combination of both, and the model returns segmentation masks and unique identities for all matching object instances.

To advance promptable concept segmentation, the authors built a scalable data engine that produces a high-quality dataset with 4 million unique concept labels, including hard negatives, across images and videos. The model consists of an image-level detector and a memory-based video tracker that share a single backbone. The recognition and localization are decoupled with a presence head, which boosts detection accuracy.

The results show that Segment Anything Model 3 doubles the accuracy of existing systems in both image and video promptable concept segmentation, and improves previous capabilities on visual segmentation tasks. The authors also open source Segment Anything Model 3 along with a new benchmark for promptable concept segmentation, called Segment Anything with Concepts.

The main contributions of the paper are the introduction of a unified model architecture that achieves state-of-the-art performance in promptable concept segmentation and tracking, the creation of a large-scale dataset with unique concept labels, and the development of a new benchmark for evaluating promptable concept segmentation models. Overall, the paper presents a significant advancement in the field of computer vision and object segmentation, enabling more accurate and efficient detection, segmentation, and tracking of objects in images and videos based on concept prompts.


📅 Published on Nov 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.16719
• PDF: https://arxiv.org/pdf/2511.16719
• Project Page: https://ai.meta.com/sam3/

🤖 Models citing this paper:
https://huggingface.co/AllanVester/SAM3.1-CoreML-FP16
https://huggingface.co/AllanVester/SAM3.1-CoreML
https://huggingface.co/embedl/sam3

🚀 Spaces citing this paper:
https://huggingface.co/spaces/kith777/rag_agent

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ObjectSegmentation #ConceptLearning #ImageTracking #PromptableSegmentation
🔥 PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

💡 The paper introduces PhysX-Omni, a unified framework for generating simulation-ready 3D assets with physical properties across multiple categories. The problem addressed is that existing 3D generation methods either neglect physical properties or are limited to a single asset category, such as rigid, deformable, or articulated objects. To address this, the authors develop a novel geometry representation tailored for vision-language models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance.

The PhysX-Omni framework generates simulation-ready physical 3D assets using this novel geometry representation. The authors also construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. To evaluate the framework, they propose PhysX-Bench, a benchmark that encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description.

The results show that PhysX-Omni performs strongly in both generation and understanding, outperforming conventional metrics and PhysX-Bench. Additional studies validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. The authors believe that PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

The key contributions of the paper are the development of a novel geometry representation, the construction of the PhysXVerse dataset, and the proposal of the PhysX-Bench benchmark. These contributions enable the generation of simulation-ready physical 3D assets across multiple categories, which can be used in various applications such as robotics, computer vision, and simulation. Overall, the paper presents a significant advancement in the field of 3D generation and simulation, with potential applications in a wide range of areas.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21572
• PDF: https://arxiv.org/pdf/2605.21572
• Project Page: https://physx-omni.github.io

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #3DModeling #PhysicsBasedSimulation #ArticulatedObjectSimulation #DeformableObjectModeling
AI & ML Papers
Photo
🔥 GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

💡 The paper proposes a self-evolving image generation framework called GenEvolve that improves generative capabilities through iterative learning and reference-based prompting. The problem addressed is that high-quality image generation often requires combining a model's internal generative ability with external resources, and existing methods have limitations in handling diverse and demanding requests.

The GenEvolve framework models each generation attempt as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing methods that rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience.

This visual experience is provided to a privileged teacher branch, which uses visual experience distillation to provide dense token-level supervision to a student branch. This helps the student internalize better search, knowledge activation, reference selection, and prompt construction. The authors also construct GenEvolve-Data and GenEvolve-Bench to evaluate the framework.

The results show that GenEvolve achieves substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. The experiments on public benchmarks and GenEvolve-Bench demonstrate the effectiveness of the proposed framework. Overall, the paper contributes a novel self-evolving image generation framework that can effectively handle diverse and demanding generation challenges.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21605
• PDF: https://arxiv.org/pdf/2605.21605
• Project Page: https://ephemeral182.github.io/GenEvolve/

🤖 Models citing this paper:
https://huggingface.co/MeiGen-AI/GenEvolve

📊 Datasets citing this paper:
https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ImageGeneration #GenerativeModels #SelfEvolvingSystems #DeepLearning
🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.

The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.

The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.


📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive

🤖 Models citing this paper:
https://huggingface.co/lhmd/TriSplat

📊 Datasets citing this paper:
https://huggingface.co/datasets/lhmd/re10k_torch

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision
🔥 A Very Big Video Reasoning Suite

💡 The paper introduces a large scale video reasoning dataset and benchmark to study video intelligence capabilities beyond visual quality. The problem addressed is that current video models have focused on visual quality and their reasoning capabilities have been underexplored. Video reasoning involves understanding spatiotemporal structure such as continuity, interaction, and causality, which is essential for intelligent systems. However, the lack of large scale training data has hindered systematic study of video reasoning.

To address this gap, the authors introduce the Very Big Video Reasoning Dataset, which is an unprecedentedly large scale resource consisting of 200 curated reasoning tasks and over one million video clips. This dataset is approximately three orders of magnitude larger than existing datasets. The authors also present VBVR-Bench, a verifiable evaluation framework that incorporates rule-based, human-aligned scorers to enable reproducible and interpretable diagnosis of video reasoning capabilities.

The results of the study show early signs of emergent generalization to unseen reasoning tasks, indicating that the proposed dataset and benchmark can be used to develop more generalizable video reasoning models. The dataset, benchmark toolkit, and models are publicly available, laying a foundation for the next stage of research in generalizable video reasoning. The contributions of the paper are the introduction of a large scale video reasoning dataset and benchmark, and the demonstration of their effectiveness in studying video reasoning capabilities and enabling the development of more generalizable models.


📅 Published on Feb 23

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2602.20159
• PDF: https://arxiv.org/pdf/2602.20159
• Project Page: https://video-reason.com/

🤖 Models citing this paper:
https://huggingface.co/Video-Reason/VBVR-Wan2.2
https://huggingface.co/Video-Reason/VBVR-LTX2.3-diffsynth
https://huggingface.co/Video-Reason/VBVR-Wan2.1-diffsynth

📊 Datasets citing this paper:
https://huggingface.co/datasets/Video-Reason/VBVR-Dataset
https://huggingface.co/datasets/Video-Reason/VBVR-Bench-Data
https://huggingface.co/datasets/Video-Reason/video-mcp

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Video-Reason/VBVR-Bench-Leaderboard

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoIntelligence #VideoReasoning #SpatiotemporalAnalysis #CausalityInAI #ComputerVision
🔥 SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

💡 The paper presents SCAIL-2, a framework for controlled character animation that enables end-to-end motion transfer from driving videos to reference characters without using intermediate representations. Prior methods relied on intermediate representations such as pose skeletons or masked backgrounds, which led to information loss. SCAIL-2 addresses this issue by directly concatenating driving videos to the sequence, allowing the model to obtain all required visual information from the input video.

To overcome the lack of end-to-end data, the authors unify sub-tasks of character animation with decoupled conditions and create a pipeline to synthesize a large dataset called MotionPair-60K, which contains heterogeneous tasks of character animation. The framework utilizes in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information.

The authors also propose Bias-Aware DPO to mitigate errors caused by synthetic discrepancies in detailed regions. This approach constructs preference items to address the issue. Extensive experiments demonstrate that SCAIL-2 substantially outperforms existing state-of-the-art approaches in various character animation tasks.

The key contributions of the paper are the development of an end-to-end character animation framework that bypasses intermediate representations, the creation of a large synthetic dataset for motion transfer, and the proposal of a novel method to address synthetic discrepancies. The results show that SCAIL-2 achieves superior performance compared to existing methods, and the authors plan to release a large subset of synthetic data and model weights to facilitate further research.


📅 Published on Jun 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.10804
• PDF: https://arxiv.org/pdf/2606.10804
• Project Page: https://teal024.github.io/SCAIL-2/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#CharacterAnimation #MotionTransfer #EndToEndLearning #InContextConditioning #ComputerVision