🔥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors
📅 Published on May 1
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93
🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures
💡 The paper introduces UniVidX, a unified multimodal framework for versatile video generation using video diffusion model priors. The problem with existing methods is that they train separate models for each task, limiting the modeling of correlations across different modalities. UniVidX addresses this issue by formulating pixel-aligned tasks as conditional generation in a shared multimodal space, allowing it to adapt to modality-specific distributions while preserving the native priors of the video diffusion model.
The framework consists of three key designs: Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention. Stochastic Condition Masking enables omni-directional conditional generation by randomly partitioning modalities into clean conditions and noisy targets during training. Decoupled Gated LoRA preserves the strong priors of the video diffusion model by introducing per-modality LoRAs that are activated when a modality serves as the generation target. Cross-Modal Self-Attention facilitates information exchange and inter-modal alignment by sharing keys and values across modalities while keeping modality-specific queries.
The authors instantiate UniVidX in two domains: UniVid-Intrinsic for RGB videos and intrinsic maps, and UniVid-Alpha for blended RGB videos and their constituent RGBA layers. The results show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1000 videos. Overall, UniVidX provides a unified framework for versatile video generation, allowing for more efficient and effective modeling of correlations across different modalities.
📅 Published on May 1
🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX ⭐ 93
🤖 Models citing this paper:
• https://huggingface.co/houyuanchen/UniVidX
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures
arXiv.org
UniVidX: A Unified Multimodal Framework for Versatile Video...
Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem...
AI & ML Papers
Photo
🔥 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
💡 The paper addresses the issue of modality sensitivity in vision-language models, which occurs when a model's performance degrades significantly when the modality of the input is changed, such as replacing a textual question with its rendered-image counterpart. This problem arises due to the inherent bias in current training corpora, where text and images are typically organized into distinct and asymmetric roles. To address this issue, the authors propose Local Modality Substitution, a data curation approach that provides supervision for cross-modal representational invariance between semantically equivalent text and image carriers. This method reformulates single-modality prompts into seamlessly interleaved multimodal sequences by dynamically selecting target text spans and recasting them as rendered images, thereby preserving the same semantics across different carriers. The authors evaluate their approach on 13 diverse multimodal benchmarks and demonstrate that it significantly improves overall multimodal reasoning and yields deeper cross-modal fusion, achieving consistent gains across foundational models. Specifically, the approach delivers improvements of 2.67 points on one model and 2.82 points on another, compared to standard methods. The proposed method is lightweight and architecture-agnostic, making it a valuable contribution to the field of vision-language models.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.