AI & ML Papers

UniVidX: A Unified Multimodal Framework for Versatile Video...

🔥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

💡 The paper introduces UniVidX, a unified multimodal framework for versatile video generation using video diffusion model priors. The problem with existing methods is that they train separate models for each task, limiting the modeling of correlations across different modalities. UniVidX addresses this issue by formulating pixel-aligned tasks as conditional generation in a shared multimodal space, allowing it to adapt to modality-specific distributions while preserving the native priors of the video diffusion model.

The framework consists of three key designs: Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention. Stochastic Condition Masking enables omni-directional conditional generation by randomly partitioning modalities into clean conditions and noisy targets during training. Decoupled Gated LoRA preserves the strong priors of the video diffusion model by introducing per-modality LoRAs that are activated when a modality serves as the generation target. Cross-Modal Self-Attention facilitates information exchange and inter-modal alignment by sharing keys and values across modalities while keeping modality-specific queries.

The authors instantiate UniVidX in two domains: UniVid-Intrinsic for RGB videos and intrinsic maps, and UniVid-Alpha for blended RGB videos and their constituent RGBA layers. The results show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1000 videos. Overall, UniVidX provides a unified framework for versatile video generation, allowing for more efficient and effective modeling of correlations across different modalities.

📅 Published on May 1

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93

🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures

arXiv.org

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem...

205 views04:59

AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map...

🔥 AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation

💡 The paper introduces AnyFlow, a novel framework for any-step video diffusion distillation that improves upon existing consistency distillation methods. The problem with consistency distillation is that its performance degrades as more sampling steps are used at test time, limiting its effectiveness for any-step video diffusion. This is because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, which weakens the desirable test-time scaling behavior of ODE sampling.

To address this limitation, AnyFlow optimizes the full ODE sampling trajectory instead of distilling a model for only a few fixed sampling steps. The method involves shifting the distillation target from endpoint consistency mapping to flow-map transition learning over arbitrary time intervals. Additionally, the authors propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors.

The results of the paper show that AnyFlow achieves performance that matches or surpasses consistency-based counterparts in the few-step regime, while also scaling with sampling step budgets. The experiments were conducted across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters. Overall, the paper contributes a new framework for any-step video diffusion distillation that improves upon existing methods and achieves state-of-the-art results.

📅 Published on May 13

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.13724
• PDF: https://arxiv.org/pdf/2605.13724
• Project Page: https://nvlabs.github.io/AnyFlow/
• GitHub: https://github.com/NVlabs/AnyFlow ⭐ 197

🤖 Models citing this paper:
• https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoDiffusionModels #OnPolicyLearning #FlowMapDistillation #AnyStepSampling #DiffusionBasedGenerativeModels

arXiv.org

Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated...

496 views19:52

🔥 RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO

💡 The paper introduces RAVEN, a real-time autoregressive video extrapolation network, and CM-GRPO, a consistency model-based reinforcement learning approach. The problem addressed is the gap between the history distributions encountered during training and those arising at inference in causal autoregressive video diffusion models, which constrains generation quality over long horizons.

To solve this problem, RAVEN repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states, aligning training attention with inference-time extrapolation. This formulation allows downstream chunk losses to supervise the history representations on which future predictions depend.

Additionally, CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online reinforcement learning directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations.

The results demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations. Furthermore, CM-GRPO provides further gains when combined with RAVEN, indicating the effectiveness of the proposed methods in improving real-time video generation.

Overall, the paper presents a novel approach to real-time video generation through causal autoregressive extrapolation with improved training alignment and consistency model-based reinforcement learning, achieving state-of-the-art results in video generation quality and performance.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15190
• PDF: https://arxiv.org/pdf/2605.15190
• Project Page: https://yanzuo.lu/raven/

🤖 Models citing this paper:
• https://huggingface.co/mvp-lab/RAVEN

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AutoregressiveVideoExtrapolation #VideoDiffusionModels #ReinforcementLearningForVideo #ConsistencyModelBasedRL #RealTimeVideoGeneration

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

772 views15:53

🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.

The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.

The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.

The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/

🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

662 views21:53

🔥 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

💡 The paper presents a comprehensive framework called minWM for converting bidirectional video diffusion models into real-time interactive video world models. The problem addressed is that recent video diffusion foundation models have achieved high-quality video generation but turning them into real-time interactive world models remains challenging due to the need for controllable, causal, and low-latency capabilities.

The method used in minWM is a full-stack open-source framework that provides an end-to-end pipeline to convert existing bidirectional video foundation models into camera-controllable few-step autoregressive world models. This is achieved through fine-tuning and distillation techniques, including causal forcing, causal consistency distillation, and asymmetric DMD. The framework is modular and architecture-extensible, allowing it to be instantiated on different open backbones and adapted to new data distributions, training recipes, and latency targets.

The results of minWM are a real-time interactive video world model that can be controlled by a camera, with low-latency rollout and high-quality video generation. The framework is released with runnable scripts, checkpoints, documentation, and inference code, along with practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. Overall, minWM provides a reproducible and extensible recipe for building and adapting real-time interactive video world models, making it a valuable contribution to the field of video generation and interactive world modeling.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤1

663 views15:52

Photo

🔥 Training Video Foundation Models with NVIDIA NeMo

💡 The paper addresses the challenges of training large scale high quality video foundation models that can generate high quality videos. Video foundation models have been used to simulate the real world and develop creative visual experiences but training them is difficult due to the complexity and size of video datasets. To overcome this the authors present a scalable open source pipeline using NVIDIA NeMo for training and inference of video foundation models. The pipeline provides accelerated video dataset curation multimodal data loading and parallelized video diffusion model training and inference. The authors also provide a comprehensive performance analysis highlighting best practices for efficient video foundation model training and inference. The pipeline is designed to address the challenges of training large scale video foundation models and provides a scalable and efficient solution for generating high quality videos. The results of the paper demonstrate the effectiveness of the pipeline in training video foundation models and provide insights into the best practices for efficient training and inference. Overall the paper contributes to the development of video foundation models by providing a scalable and efficient pipeline for training and inference.

📅 Published on Mar 17, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2503.12964
• PDF: https://arxiv.org/pdf/2503.12964

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoFoundationModels #NVIDIANeMo #VideoDatasetCuration #MultimodalLearning #VideoDiffusionModels

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤1

856 views17:52

Photo

🔥 OPSD-V: On-Policy Self-Distillation for Post-Training Few-Step Autoregressive Video Generators

💡 The paper proposes a method called On-Policy Self-Distillation for Post-Training Few-Step Autoregressive Video Generators, or OPSD-V, which aims to improve the quality of videos generated by few-step autoregressive video diffusion models. The problem with existing models is that they can produce long videos with low latency, but the quality of the video degrades over time due to error accumulation and weakened motion dynamics.

To address this issue, OPSD-V introduces real long-video data as temporal context during training, providing dense trajectory-level supervision to improve visual quality and motion dynamics. The method works by having a student model follow the exact inference-time rollout, generating each chunk of the video conditioned on its own previously generated cache. In parallel, a teacher model is evaluated at the same denoising states, but uses a cleaner temporal cache that can be replaced by real-video context. This provides corrective targets under on-policy cache dynamics, without changing the inference mechanism.

The results show that OPSD-V consistently improves the visual quality, motion dynamics, and VBenchLong scores of the generated videos. The method is applied to representative few-step autoregressive video models, including Self-Forcing and LongLive, and the experiments demonstrate significant improvements. A user study with 10 participants also shows that OPSD-V is preferred over the base models in 66 percent of overall-preference judgments, and 82.5 percent excluding ties. Overall, the paper contributes a novel method for improving the quality of videos generated by few-step autoregressive video diffusion models, without altering the inference mechanism.

📅 Published on Jul 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.08766
• PDF: https://arxiv.org/pdf/2607.08766
• Project Page: https://meigen-ai.github.io/OPSD-V/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AutoregressiveVideoGeneration #VideoDiffusionModels #SelfDistillationTechniques #FewStepVideoGeneration #PostTrainingOptimization

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤2

919 views11:53