✨Learning to Refocus with Video Diffusion Models
📝 Summary:
A novel method enables realistic post-capture refocusing from a single defocused image. It uses video diffusion models to generate a focal stack for interactive focus adjustment. This approach outperforms existing methods, improving photography focus-editing.
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19823
• PDF: https://arxiv.org/pdf/2512.19823
• Project Page: https://learn2refocus.github.io/
• Github: https://github.com/tedlasai/learn2refocus
🔹 Models citing this paper:
• https://huggingface.co/tedlasai/learn2refocus
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VideoDiffusionModels #ComputationalPhotography #ImageRefocusing #DeepLearning #ComputerVision
📝 Summary:
A novel method enables realistic post-capture refocusing from a single defocused image. It uses video diffusion models to generate a focal stack for interactive focus adjustment. This approach outperforms existing methods, improving photography focus-editing.
🔹 Publication Date: Published on Dec 22
🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.19823
• PDF: https://arxiv.org/pdf/2512.19823
• Project Page: https://learn2refocus.github.io/
• Github: https://github.com/tedlasai/learn2refocus
🔹 Models citing this paper:
• https://huggingface.co/tedlasai/learn2refocus
==================================
For more data science resources:
✓ https://xn--r1a.website/DataScienceT
#VideoDiffusionModels #ComputationalPhotography #ImageRefocusing #DeepLearning #ComputerVision
❤3
🔥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
📅 Published on May 1
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93
🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures
💡 The paper introduces UniVidX, a unified multimodal framework for versatile video generation using video diffusion model priors. The problem with existing methods is that they train separate models for each task, limiting the modeling of correlations across different modalities. UniVidX addresses this issue by formulating pixel-aligned tasks as conditional generation in a shared multimodal space, allowing it to adapt to modality-specific distributions while preserving the native priors of the video diffusion model.
The framework consists of three key designs: Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention. Stochastic Condition Masking enables omni-directional conditional generation by randomly partitioning modalities into clean conditions and noisy targets during training. Decoupled Gated LoRA preserves the strong priors of the video diffusion model by introducing per-modality LoRAs that are activated when a modality serves as the generation target. Cross-Modal Self-Attention facilitates information exchange and inter-modal alignment by sharing keys and values across modalities while keeping modality-specific queries.
The authors instantiate UniVidX in two domains: UniVid-Intrinsic for RGB videos and intrinsic maps, and UniVid-Alpha for blended RGB videos and their constituent RGBA layers. The results show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1000 videos. Overall, UniVidX provides a unified framework for versatile video generation, allowing for more efficient and effective modeling of correlations across different modalities.
📅 Published on May 1
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93
🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures
arXiv.org
UniVidX: A Unified Multimodal Framework for Versatile Video...
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem...
🔥 AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map Distillation
📅 Published on May 13
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.13724
• PDF: https://arxiv.org/pdf/2605.13724
• Project Page: https://nvlabs.github.io/AnyFlow/
• GitHub: https://github.com/NVlabs/AnyFlow ⭐ 197
🤖 Models citing this paper:
• https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #OnPolicyLearning #FlowMapDistillation #AnyStepSampling #DiffusionBasedGenerativeModels
💡 The paper introduces AnyFlow, a novel framework for any-step video diffusion distillation that improves upon existing consistency distillation methods. The problem with consistency distillation is that its performance degrades as more sampling steps are used at test time, limiting its effectiveness for any-step video diffusion. This is because consistency distillation replaces the original probability-flow ODE trajectory with a consistency-sampling trajectory, which weakens the desirable test-time scaling behavior of ODE sampling.
To address this limitation, AnyFlow optimizes the full ODE sampling trajectory instead of distilling a model for only a few fixed sampling steps. The method involves shifting the distillation target from endpoint consistency mapping to flow-map transition learning over arbitrary time intervals. Additionally, the authors propose Flow Map Backward Simulation, which decomposes a full Euler rollout into shortcut flow-map transitions, enabling efficient on-policy distillation that reduces test-time errors.
The results of the paper show that AnyFlow achieves performance that matches or surpasses consistency-based counterparts in the few-step regime, while also scaling with sampling step budgets. The experiments were conducted across both bidirectional and causal architectures, at scales ranging from 1.3B to 14B parameters. Overall, the paper contributes a new framework for any-step video diffusion distillation that improves upon existing methods and achieves state-of-the-art results.
📅 Published on May 13
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.13724
• PDF: https://arxiv.org/pdf/2605.13724
• Project Page: https://nvlabs.github.io/AnyFlow/
• GitHub: https://github.com/NVlabs/AnyFlow ⭐ 197
🤖 Models citing this paper:
• https://huggingface.co/nvidia/AnyFlow-Wan2.1-T2V-14B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-1.3B-Diffusers
• https://huggingface.co/nvidia/AnyFlow-FAR-Wan2.1-14B-Diffusers
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #OnPolicyLearning #FlowMapDistillation #AnyStepSampling #DiffusionBasedGenerativeModels
arXiv.org
AnyFlow: Any-Step Video Diffusion Model with On-Policy Flow Map...
Few-step video generation has been significantly advanced by consistency distillation. However, the performance of consistency-distilled models often degrades as more sampling steps are allocated...
🔥 RAVEN: Real-time Autoregressive Video Extrapolation with Consistency-model GRPO
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15190
• PDF: https://arxiv.org/pdf/2605.15190
• Project Page: https://yanzuo.lu/raven/
🤖 Models citing this paper:
• https://huggingface.co/mvp-lab/RAVEN
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AutoregressiveVideoExtrapolation #VideoDiffusionModels #ReinforcementLearningForVideo #ConsistencyModelBasedRL #RealTimeVideoGeneration
💡 The paper introduces RAVEN, a real-time autoregressive video extrapolation network, and CM-GRPO, a consistency model-based reinforcement learning approach. The problem addressed is the gap between the history distributions encountered during training and those arising at inference in causal autoregressive video diffusion models, which constrains generation quality over long horizons.
To solve this problem, RAVEN repacks each self rollout into an interleaved sequence of clean historical endpoints and noisy denoising states, aligning training attention with inference-time extrapolation. This formulation allows downstream chunk losses to supervise the history representations on which future predictions depend.
Additionally, CM-GRPO reformulates a consistency sampling step as a conditional Gaussian transition and applies online reinforcement learning directly to this kernel, avoiding the Euler-Maruyama auxiliary process adopted in prior flow-model RL formulations.
The results demonstrate that RAVEN surpasses recent causal video distillation baselines across quality, semantic, and dynamic degree evaluations. Furthermore, CM-GRPO provides further gains when combined with RAVEN, indicating the effectiveness of the proposed methods in improving real-time video generation.
Overall, the paper presents a novel approach to real-time video generation through causal autoregressive extrapolation with improved training alignment and consistency model-based reinforcement learning, achieving state-of-the-art results in video generation quality and performance.
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15190
• PDF: https://arxiv.org/pdf/2605.15190
• Project Page: https://yanzuo.lu/raven/
🤖 Models citing this paper:
• https://huggingface.co/mvp-lab/RAVEN
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AutoregressiveVideoExtrapolation #VideoDiffusionModels #ReinforcementLearningForVideo #ConsistencyModelBasedRL #RealTimeVideoGeneration
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.
The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.
The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.
The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks
💡 The paper presents a comprehensive framework called minWM for converting bidirectional video diffusion models into real-time interactive video world models. The problem addressed is that recent video diffusion foundation models have achieved high-quality video generation but turning them into real-time interactive world models remains challenging due to the need for controllable, causal, and low-latency capabilities.
The method used in minWM is a full-stack open-source framework that provides an end-to-end pipeline to convert existing bidirectional video foundation models into camera-controllable few-step autoregressive world models. This is achieved through fine-tuning and distillation techniques, including causal forcing, causal consistency distillation, and asymmetric DMD. The framework is modular and architecture-extensible, allowing it to be instantiated on different open backbones and adapted to new data distributions, training recipes, and latency targets.
The results of minWM are a real-time interactive video world model that can be controlled by a camera, with low-latency rollout and high-quality video generation. The framework is released with runnable scripts, checkpoints, documentation, and inference code, along with practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. Overall, minWM provides a reproducible and extensible recipe for building and adapting real-time interactive video world models, making it a valuable contribution to the field of video generation and interactive world modeling.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.