AI & ML Papers
Photo
🔥 MMSkills: Towards Multimodal Skills for General Visual Agents
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.13527
• PDF: https://arxiv.org/pdf/2605.13527
• Project Page: https://deepexperience.github.io/MMSkills/
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhangkangning/mmskills
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalProceduralKnowledge #VisualDecisionMaking #MultimodalSkills #GeneralVisualAgents #ProceduralKnowledgeRepresentation
💡 The paper introduces MMSkills, a framework for representing and using reusable multimodal procedures for visual decision making in complex environments. The authors argue that current skill packages for visual agents are limited because they primarily rely on textual prompts or executable code, and do not account for the multimodal nature of procedural knowledge. To address this, the authors formalize the concept of multimodal procedural knowledge, which requires recognizing relevant state, interpreting visual evidence, and deciding what to do next.
The authors identify three practical challenges in developing multimodal skill packages: defining the contents of a package, deriving packages from public interaction experience, and consulting multimodal evidence at inference time. To overcome these challenges, the authors propose a framework that represents each skill as a compact package containing a textual procedure, runtime state cards, and multi-view keyframes.
The authors develop an agentic trajectory-to-skill generator that transforms public non-evaluation trajectories into reusable multimodal skills through workflow grouping, procedure induction, visual grounding, and meta-skill-guided auditing. This generator enables the construction of multimodal skill packages from public interaction experience.
To use these packages, the authors introduce a branch-loaded multimodal skill agent that inspects selected state cards and keyframes in a temporary branch, aligns them with the live environment, and distills them into structured guidance for the main agent. This approach allows the agent to consult multimodal evidence at inference time without excessive image context or over-anchoring to reference screenshots.
The authors evaluate MMSkills on GUI and game-based visual-agent benchmarks and demonstrate that it consistently improves the performance of both frontier and smaller multimodal agents. The results suggest that external multimodal procedural knowledge complements model-internal priors, and that MMSkills provides a effective framework for representing and using reusable multimodal procedures for visual decision making. Overall, the paper contributes a new framework for multimodal skills, a method for generating these skills from public interaction experience, and a approach for using these skills in visual decision making.
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.13527
• PDF: https://arxiv.org/pdf/2605.13527
• Project Page: https://deepexperience.github.io/MMSkills/
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhangkangning/mmskills
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalProceduralKnowledge #VisualDecisionMaking #MultimodalSkills #GeneralVisualAgents #ProceduralKnowledgeRepresentation
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15824
• PDF: https://arxiv.org/pdf/2605.15824
• Project Page: https://quanjiansong.github.io/projects/FashionChameleon/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#RealTimeVideoCustomization #HumanGarmentInteraction #AutoregressiveVideoGeneration #InteractiveGarmentControl #EcommerceVideoTechnology
💡 The paper introduces FashionChameleon, a real-time and interactive framework for human-garment video customization in autoregressive video generation. The problem addressed is the inability of existing approaches to support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation.
To solve this problem, the authors propose a method that consists of three key techniques. First, they train a Teacher Model with In-Context Learning on a single reference-garment pair, which encourages the model to implicitly preserve coherence during single-garment switching. Second, they introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. Third, they propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence.
The results show that FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU. This is 30-180 times faster than existing baselines. The framework enables users to interactively switch garments during generation, making it a significant contribution to the field of human-centric video customization. Overall, the paper presents a novel approach to achieving real-time and interactive human-garment video customization, which has significant commercial value and potential applications in e-commerce and content creation.
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15824
• PDF: https://arxiv.org/pdf/2605.15824
• Project Page: https://quanjiansong.github.io/projects/FashionChameleon/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#RealTimeVideoCustomization #HumanGarmentInteraction #AutoregressiveVideoGeneration #InteractiveGarmentControl #EcommerceVideoTechnology
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 ReactiveGWM: Steering NPC in Reactive Game World Models
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15256
• PDF: https://arxiv.org/pdf/2605.15256
• Project Page: https://inv-wzq.github.io/ReactiveGWM/
🤖 Models citing this paper:
• https://huggingface.co/INV-WZQ/ReactiveGWM-Models
📊 Datasets citing this paper:
• https://huggingface.co/datasets/INV-WZQ/ReactiveGWM-Datasets
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#GameWorldModels #ReactiveGameDevelopment #NPCAI #GamePhysicsSimulation #ReactiveGameWorldModeling
💡 Current game world models have limitations as they simulate environments from a player centric perspective and treat non player characters as background elements, failing to capture interactions between the player and the non player character. This results in models that lack physical understanding and cannot simulate action induced non player character reactions.
The paper introduces ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and the non player character by decoupling player controls from non player character behaviors. This is achieved through the use of diffusion models with cross attention modules that learn a game agnostic representation of interactive logic, allowing for zero shot strategy transfer across different games.
In the proposed method, player actions are injected into the diffusion backbone via a lightweight additive bias, while high level non player character responses are grounded through cross attention modules. This enables the model to learn a game agnostic representation of interactive logic, which can be transferred to other games without requiring domain specific retraining.
The results show that ReactiveGWM maintains fine grain player controllability while achieving robust and prompt aligned non player character strategy adherence. The model is evaluated on two Street Fighter games, demonstrating its ability to unlock steerable non player character interactions without requiring domain specific retraining. Overall, the paper contributes a novel approach to simulating dynamic interactions between players and non player characters in game worlds, paving the way for scalable and strategy rich interactions with non player characters.
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15256
• PDF: https://arxiv.org/pdf/2605.15256
• Project Page: https://inv-wzq.github.io/ReactiveGWM/
🤖 Models citing this paper:
• https://huggingface.co/INV-WZQ/ReactiveGWM-Models
📊 Datasets citing this paper:
• https://huggingface.co/datasets/INV-WZQ/ReactiveGWM-Datasets
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#GameWorldModels #ReactiveGameDevelopment #NPCAI #GamePhysicsSimulation #ReactiveGameWorldModeling
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.16257
• PDF: https://arxiv.org/pdf/2605.16257
• Project Page: https://dexjoco.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DexterousManipulation #TaskOrientedRobotics #MuJoCoBenchmark #RoboticHandControl #BimanualCoordination
💡 The paper presents DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, which aims to advance the capabilities of robotic hands in complex object interactions. The problem addressed is the lack of standardized benchmarks for evaluating dexterous manipulation, with existing benchmarks lacking tasks that reflect the unique capabilities of dexterous hands. To address this, the authors developed DexJoCo, which comprises 11 functional tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning.
The method used to achieve this involves developing a low-cost data collection system, which collected 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. The authors also benchmarked modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation.
The results of the paper include identifying several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. The authors found that through extensive empirical analysis, current policies struggle with tasks that require long-horizon execution, bimanual coordination, and tool-use, and that domain randomization is essential for assessing the robustness of policies. Overall, the paper provides a comprehensive benchmark and toolkit for task-oriented dexterous manipulation, which can be used to evaluate and improve the capabilities of robotic hands in complex object interactions.
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.16257
• PDF: https://arxiv.org/pdf/2605.16257
• Project Page: https://dexjoco.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DexterousManipulation #TaskOrientedRobotics #MuJoCoBenchmark #RoboticHandControl #BimanualCoordination
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14333
• PDF: https://arxiv.org/pdf/2605.14333
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AutoregressiveImageGeneration #DiscreteTokenization #FaceReconstruction #TextReconstruction #VisualTokenization
💡 The paper InsightTok proposes a new discrete visual tokenization framework to improve the quality of autoregressive image generation, particularly for text and face reconstruction. The problem addressed is that current discrete tokenization methods often discard fine-grained structures necessary for preserving readable text and distinctive facial features due to aggressive downsampling and quantization. This is because standard discrete-tokenizer objectives are not well aligned with text legibility and facial fidelity, as they optimize generic reconstruction while compressing diverse content uniformly.
To address this issue, the authors propose InsightTok, which uses localized, content-aware perceptual losses to enhance text and face fidelity. This approach allows the tokenizer to prioritize the preservation of important details in text and faces, resulting in better reconstruction quality. The InsightTok framework uses a compact 16k codebook and a 16x downsampling rate, which is relatively efficient compared to prior methods.
The results show that InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. Furthermore, the gains achieved by InsightTok consistently transfer to autoregressive image generation, producing images with clearer text and more faithful facial details. The paper highlights the potential of specialized supervision in tokenizer training for advancing discrete image generation, demonstrating that a simple yet effective approach can lead to significant improvements in image generation quality. Overall, the InsightTok framework provides a new direction for improving the quality of autoregressive image generation, particularly for applications where text and face reconstruction are critical.
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14333
• PDF: https://arxiv.org/pdf/2605.14333
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AutoregressiveImageGeneration #DiscreteTokenization #FaceReconstruction #TextReconstruction #VisualTokenization
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1
AI & ML Papers
Photo
🔥 dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model
📅 Published on Dec 2, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning
💡 The paper introduces dots.ocr, a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing by jointly learning layout detection, text recognition, and relational understanding. The current methods for document layout parsing rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. The proposed model addresses this issue by using a single Vision-Language Model that jointly learns the three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, enabling the model to deliver robust performance across a wide array of tasks, languages, layouts, and domains. The model is validated on the OmniDocBench and XDocParse benchmarks, with the latter being a new challenging benchmark introduced in the paper that spans 126 languages. The results show that dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a 7.4 point margin and proving its unparalleled multilingual capabilities. The paper's contributions include the introduction of a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing, the creation of a new benchmark for multilingual document intelligence, and the demonstration of the advantages of jointly learning layout detection, text recognition, and relational understanding within a single model.
📅 Published on Dec 2, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1
🔥 Auditing Agent Harness Safety
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14271
• PDF: https://arxiv.org/pdf/2605.14271
• Project Page: https://harnessaudit.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LanguageModelSafety #AgentHarnessSecurity #ExecutionTrajectoryAudit #SafetyConstraintEvaluation #HarnessComplianceAssessment
💡 The paper Auditing Agent Harness Safety addresses the issue of ensuring safety constraints are met during the execution of large language model agents within execution harnesses. These agents can produce correct outputs while violating safety constraints during execution, which cannot be detected by evaluating only the final output. The authors propose a framework called HarnessAudit, which audits the full execution trajectory of agents across three dimensions: boundary compliance, execution fidelity, and system stability. They also introduce a benchmark called HarnessAudit-Bench, consisting of 210 tasks across eight real-world domains, to evaluate the safety of agent harnesses.
The authors evaluate ten harness configurations across different models and frameworks and find that task completion does not guarantee safe execution, and safety violations accumulate as the execution trajectory length increases. They also find that safety risks vary across domains, task types, and agent roles, with most violations occurring in resource access and inter-agent information transfer. Additionally, they discover that multi-agent collaboration increases the safety risk surface, while harness design sets the upper bound of safe deployment.
The paper's contributions include the development of the HarnessAudit framework and the HarnessAudit-Bench benchmark, which provide a comprehensive approach to auditing agent harness safety. The results highlight the importance of trajectory-level auditing and the need for careful harness design to ensure safe deployment of agent harnesses, particularly in multi-agent systems. Overall, the paper provides a significant step towards ensuring the safety and reliability of large language model agents in real-world applications.
📅 Published on May 14
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14271
• PDF: https://arxiv.org/pdf/2605.14271
• Project Page: https://harnessaudit.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LanguageModelSafety #AgentHarnessSecurity #ExecutionTrajectoryAudit #SafetyConstraintEvaluation #HarnessComplianceAssessment
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 Unlocking Dense Metric Depth Estimation in VLMs
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/
🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision
💡 The paper proposes DepthVLM, a framework that enhances Vision-Language Models with dense geometry prediction capabilities. Vision-Language Models are limited in 3D understanding due to their text-only supervision paradigm, which prevents the recovery of dense geometry. Prior methods have limitations such as error accumulation or inefficient prediction. DepthVLM addresses this by attaching a lightweight depth head to the model backbone and training it under a unified vision-text supervision paradigm with a two-stage schedule. This allows the model to generate full-resolution depth maps alongside language outputs in a single forward pass. The authors also introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. The results show that DepthVLM significantly outperforms existing Vision-Language Models, surpasses leading pure vision models, and improves complex 3D spatial reasoning, making it a step toward a truly unified foundation model. The code and checkpoints will be publicly released, making it accessible for further research and development. Overall, DepthVLM provides a simple yet effective solution for dense metric depth estimation in Vision-Language Models, unlocking their potential for 3D understanding and spatial reasoning.
📅 Published on May 15
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/
🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤2
Forwarded from Machine Learning with Python
🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸
Join our channel today for free! Tomorrow it will cost 500$!
https://xn--r1a.website/+-WZeIeP8YI8wM2E6
You can join at this link! 👆👇
https://xn--r1a.website/+-WZeIeP8YI8wM2E6
Join our channel today for free! Tomorrow it will cost 500$!
https://xn--r1a.website/+-WZeIeP8YI8wM2E6
You can join at this link! 👆👇
https://xn--r1a.website/+-WZeIeP8YI8wM2E6
❤1
AI & ML Papers
Photo
🔥 Lance: Unified Multimodal Modeling by Multi-Task Synergy
📅 Published on May 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18678
• PDF: https://arxiv.org/pdf/2605.18678
• Project Page: https://lance-project.github.io/
🤖 Models citing this paper:
• https://huggingface.co/bytedance-research/Lance
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Nayefleb/Lance
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalModeling #MultitaskLearning #DualStreamArchitecture #MixtureOfExperts #UnifiedModelingApproach
💡 The paper introduces Lance, a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos. The goal is to develop a model that can handle multiple tasks without relying on large model capacity or focusing on specific modalities like text or images. Lance achieves this through a dual-stream architecture and collaborative multi-task training, which enables joint context learning while separating the pathways for understanding and generation.
The model uses a mixture-of-experts architecture on shared multimodal sequences, allowing it to learn from both images and videos simultaneously. To address interference among different visual tokens, the model employs modality-aware rotary positional encoding, which helps to align tasks across different modalities.
During training, Lance uses a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling. This approach strengthens both semantic comprehension and visual generation performance. The results show that Lance outperforms existing unified models in image and video generation while maintaining strong multimodal understanding capabilities.
Overall, Lance presents a practical approach to unified multimodal modeling, demonstrating that collaborative multi-task training and a dual-stream architecture can lead to improved performance in multiple tasks without requiring large model capacity. The model has the potential to be applied to various applications that require multimodal understanding, generation, and editing capabilities.
📅 Published on May 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18678
• PDF: https://arxiv.org/pdf/2605.18678
• Project Page: https://lance-project.github.io/
🤖 Models citing this paper:
• https://huggingface.co/bytedance-research/Lance
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Nayefleb/Lance
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MultimodalModeling #MultitaskLearning #DualStreamArchitecture #MixtureOfExperts #UnifiedModelingApproach
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation
📅 Published on May 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18739
• PDF: https://arxiv.org/pdf/2605.18739
• Project Page: https://nvlabs.github.io/LongLive/LongLive2/
🤖 Models citing this paper:
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2
📊 Datasets citing this paper:
• https://huggingface.co/datasets/Efficient-Large-Model/LongLive2.0-Toy-Dataset
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LongVideoGeneration #ParallelInfrastructure #NVFP4 #AutoregressiveTraining #DiffusionModeling
💡 The paper introduces LongLive-2.0, a parallel infrastructure for long video generation that addresses training and inference bottlenecks. The problem with existing methods is that they are slow and require a lot of memory, especially for long videos. To solve this, the authors propose a sequence-parallel autoregressive training method called Balanced SP, which pairs clean-history and noisy-target temporal chunks on each rank, enabling efficient teacher-forcing and reducing GPU memory cost.
The method also uses NVFP4 precision to accelerate GEMM computation during training. Additionally, the authors tune a diffusion model into a long, multi-shot, interactive auto-regressive diffusion model, which can be converted to real-time generation with standalone LoRA weights. For inference, the authors enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding.
The results show that LongLive-2.0 achieves up to 2.15x speedup in training and 1.84x in inference. The LongLive-2.0-5B model achieves 45.7 FPS inference while attaining strong performance on benchmarks. The authors claim that LongLive-2.0 is the first NVFP4 training and inference system for long video generation, making it a significant contribution to the field. Overall, the paper presents a novel parallel infrastructure that addresses the speed and memory bottlenecks in long video generation, making it possible to generate high-quality videos in real-time.
📅 Published on May 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18739
• PDF: https://arxiv.org/pdf/2605.18739
• Project Page: https://nvlabs.github.io/LongLive/LongLive2/
🤖 Models citing this paper:
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4
• https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2
📊 Datasets citing this paper:
• https://huggingface.co/datasets/Efficient-Large-Model/LongLive2.0-Toy-Dataset
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#LongVideoGeneration #ParallelInfrastructure #NVFP4 #AutoregressiveTraining #DiffusionModeling
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.