AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

393 views15:49

282 views15:49

🔥 FashionChameleon: Towards Real-Time and Interactive Human-Garment Video Customization

💡 The paper introduces FashionChameleon, a real-time and interactive framework for human-garment video customization in autoregressive video generation. The problem addressed is the inability of existing approaches to support low-latency and interactive garment control, which is crucial for applications such as e-commerce and content creation.

To solve this problem, the authors propose a method that consists of three key techniques. First, they train a Teacher Model with In-Context Learning on a single reference-garment pair, which encourages the model to implicitly preserve coherence during single-garment switching. Second, they introduce Streaming Distillation with In-Context Learning, which fine-tunes the model with in-context teacher forcing and improves extrapolation consistency via gradient-reweighted distribution matching distillation. Third, they propose Training-Free KV Cache Rescheduling, which includes garment KV refresh, historical KV withdraw, and reference KV disentangle to achieve garment switching while preserving motion coherence.

The results show that FashionChameleon uniquely supports interactive customization and consistent long-video extrapolation, while achieving real-time generation at 23.8 FPS on a single GPU. This is 30-180 times faster than existing baselines. The framework enables users to interactively switch garments during generation, making it a significant contribution to the field of human-centric video customization. Overall, the paper presents a novel approach to achieving real-time and interactive human-garment video customization, which has significant commercial value and potential applications in e-commerce and content creation.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15824
• PDF: https://arxiv.org/pdf/2605.15824
• Project Page: https://quanjiansong.github.io/projects/FashionChameleon/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RealTimeVideoCustomization #HumanGarmentInteraction #AutoregressiveVideoGeneration #InteractiveGarmentControl #EcommerceVideoTechnology

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

299 views15:49

🔥 ReactiveGWM: Steering NPC in Reactive Game World Models

💡 Current game world models have limitations as they simulate environments from a player centric perspective and treat non player characters as background elements, failing to capture interactions between the player and the non player character. This results in models that lack physical understanding and cannot simulate action induced non player character reactions.

The paper introduces ReactiveGWM, a reactive game world model that synthesizes dynamic interactions between the player and the non player character by decoupling player controls from non player character behaviors. This is achieved through the use of diffusion models with cross attention modules that learn a game agnostic representation of interactive logic, allowing for zero shot strategy transfer across different games.

In the proposed method, player actions are injected into the diffusion backbone via a lightweight additive bias, while high level non player character responses are grounded through cross attention modules. This enables the model to learn a game agnostic representation of interactive logic, which can be transferred to other games without requiring domain specific retraining.

The results show that ReactiveGWM maintains fine grain player controllability while achieving robust and prompt aligned non player character strategy adherence. The model is evaluated on two Street Fighter games, demonstrating its ability to unlock steerable non player character interactions without requiring domain specific retraining. Overall, the paper contributes a novel approach to simulating dynamic interactions between players and non player characters in game worlds, paving the way for scalable and strategy rich interactions with non player characters.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15256
• PDF: https://arxiv.org/pdf/2605.15256
• Project Page: https://inv-wzq.github.io/ReactiveGWM/

🤖 Models citing this paper:
• https://huggingface.co/INV-WZQ/ReactiveGWM-Models

📊 Datasets citing this paper:
• https://huggingface.co/datasets/INV-WZQ/ReactiveGWM-Datasets

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GameWorldModels #ReactiveGameDevelopment #NPCAI #GamePhysicsSimulation #ReactiveGameWorldModeling

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

431 views15:49

This media is not supported in your browser

1:16

VIEW IN TELEGRAM

381 views15:49

🔥 DexJoCo: A Benchmark and Toolkit for Task-Oriented Dexterous Manipulation on MuJoCo

💡 The paper presents DexJoCo, a benchmark and toolkit for task-oriented dexterous manipulation, which aims to advance the capabilities of robotic hands in complex object interactions. The problem addressed is the lack of standardized benchmarks for evaluating dexterous manipulation, with existing benchmarks lacking tasks that reflect the unique capabilities of dexterous hands. To address this, the authors developed DexJoCo, which comprises 11 functional tasks that evaluate tool-use, bimanual coordination, long-horizon execution, and reasoning.

The method used to achieve this involves developing a low-cost data collection system, which collected 1.1K trajectories across these tasks, with support for domain randomization to assess robustness. The authors also benchmarked modern models under diverse settings, including visual and dynamics randomization, multi-task training, and action-head adaptation.

The results of the paper include identifying several important insights and common limitations of current policies in dexterous manipulation, highlighting key challenges for future research in dexterous hand robot learning. The authors found that through extensive empirical analysis, current policies struggle with tasks that require long-horizon execution, bimanual coordination, and tool-use, and that domain randomization is essential for assessing the robustness of policies. Overall, the paper provides a comprehensive benchmark and toolkit for task-oriented dexterous manipulation, which can be used to evaluate and improve the capabilities of robotic hands in complex object interactions.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.16257
• PDF: https://arxiv.org/pdf/2605.16257
• Project Page: https://dexjoco.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DexterousManipulation #TaskOrientedRobotics #MuJoCoBenchmark #RoboticHandControl #BimanualCoordination

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

430 views15:50

This media is not supported in your browser

0:14

412 views15:50

416 views15:50

🔥 InsightTok: Improving Text and Face Fidelity in Discrete Tokenization for Autoregressive Image Generation

💡 The paper InsightTok proposes a new discrete visual tokenization framework to improve the quality of autoregressive image generation, particularly for text and face reconstruction. The problem addressed is that current discrete tokenization methods often discard fine-grained structures necessary for preserving readable text and distinctive facial features due to aggressive downsampling and quantization. This is because standard discrete-tokenizer objectives are not well aligned with text legibility and facial fidelity, as they optimize generic reconstruction while compressing diverse content uniformly.

To address this issue, the authors propose InsightTok, which uses localized, content-aware perceptual losses to enhance text and face fidelity. This approach allows the tokenizer to prioritize the preservation of important details in text and faces, resulting in better reconstruction quality. The InsightTok framework uses a compact 16k codebook and a 16x downsampling rate, which is relatively efficient compared to prior methods.

The results show that InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. Furthermore, the gains achieved by InsightTok consistently transfer to autoregressive image generation, producing images with clearer text and more faithful facial details. The paper highlights the potential of specialized supervision in tokenizer training for advancing discrete image generation, demonstrating that a simple yet effective approach can lead to significant improvements in image generation quality. Overall, the InsightTok framework provides a new direction for improving the quality of autoregressive image generation, particularly for applications where text and face reconstruction are critical.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14333
• PDF: https://arxiv.org/pdf/2605.14333

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AutoregressiveImageGeneration #DiscreteTokenization #FaceReconstruction #TextReconstruction #VisualTokenization

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

583 views15:50

547 views01:50

🔥 dots.ocr: Multilingual Document Layout Parsing in a Single Vision-Language Model

💡 The paper introduces dots.ocr, a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing by jointly learning layout detection, text recognition, and relational understanding. The current methods for document layout parsing rely on fragmented, multi-stage pipelines that suffer from error propagation and fail to leverage the synergies of joint training. The proposed model addresses this issue by using a single Vision-Language Model that jointly learns the three core tasks within a unified, end-to-end framework. This is made possible by a highly scalable data engine that synthesizes a vast multilingual corpus, enabling the model to deliver robust performance across a wide array of tasks, languages, layouts, and domains. The model is validated on the OmniDocBench and XDocParse benchmarks, with the latter being a new challenging benchmark introduced in the paper that spans 126 languages. The results show that dots.ocr establishes a powerful new baseline, outperforming the next-best competitor by a 7.4 point margin and proving its unparalleled multilingual capabilities. The paper's contributions include the introduction of a unified Vision-Language Model that achieves state-of-the-art performance on document layout parsing, the creation of a new benchmark for multilingual document intelligence, and the demonstration of the advantages of jointly learning layout detection, text recognition, and relational understanding within a single model.

📅 Published on Dec 2, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.02498
• PDF: https://arxiv.org/pdf/2512.02498

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentLayoutParsing #VisionLanguageModels #MultilingualOCR #RelationalUnderstanding #EndToEndLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

607 views01:50

🔥 Auditing Agent Harness Safety

💡 The paper Auditing Agent Harness Safety addresses the issue of ensuring safety constraints are met during the execution of large language model agents within execution harnesses. These agents can produce correct outputs while violating safety constraints during execution, which cannot be detected by evaluating only the final output. The authors propose a framework called HarnessAudit, which audits the full execution trajectory of agents across three dimensions: boundary compliance, execution fidelity, and system stability. They also introduce a benchmark called HarnessAudit-Bench, consisting of 210 tasks across eight real-world domains, to evaluate the safety of agent harnesses.

The authors evaluate ten harness configurations across different models and frameworks and find that task completion does not guarantee safe execution, and safety violations accumulate as the execution trajectory length increases. They also find that safety risks vary across domains, task types, and agent roles, with most violations occurring in resource access and inter-agent information transfer. Additionally, they discover that multi-agent collaboration increases the safety risk surface, while harness design sets the upper bound of safe deployment.

The paper's contributions include the development of the HarnessAudit framework and the HarnessAudit-Bench benchmark, which provide a comprehensive approach to auditing agent harness safety. The results highlight the importance of trajectory-level auditing and the need for careful harness design to ensure safe deployment of agent harnesses, particularly in multi-agent systems. Overall, the paper provides a significant step towards ensuring the safety and reliability of large language model agents in real-world applications.

📅 Published on May 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.14271
• PDF: https://arxiv.org/pdf/2605.14271
• Project Page: https://harnessaudit.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LanguageModelSafety #AgentHarnessSecurity #ExecutionTrajectoryAudit #SafetyConstraintEvaluation #HarnessComplianceAssessment

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

627 views11:51

514 views11:51

🔥 Unlocking Dense Metric Depth Estimation in VLMs

💡 The paper proposes DepthVLM, a framework that enhances Vision-Language Models with dense geometry prediction capabilities. Vision-Language Models are limited in 3D understanding due to their text-only supervision paradigm, which prevents the recovery of dense geometry. Prior methods have limitations such as error accumulation or inefficient prediction. DepthVLM addresses this by attaching a lightweight depth head to the model backbone and training it under a unified vision-text supervision paradigm with a two-stage schedule. This allows the model to generate full-resolution depth maps alongside language outputs in a single forward pass. The authors also introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. The results show that DepthVLM significantly outperforms existing Vision-Language Models, surpasses leading pure vision models, and improves complex 3D spatial reasoning, making it a step toward a truly unified foundation model. The code and checkpoints will be publicly released, making it accessible for further research and development. Overall, DepthVLM provides a simple yet effective solution for dense metric depth estimation in Vision-Language Models, unlocking their potential for 3D understanding and spatial reasoning.

📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15876
• PDF: https://arxiv.org/pdf/2605.15876
• Project Page: https://depthvlm.github.io/

🤖 Models citing this paper:
• https://huggingface.co/JonnyYu828/DepthVLM-4B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/JonnyYu828/DepthVLM-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #DenseMetricDepthEstimation #DepthEstimationInVLMs #GeometryPrediction #VisionTextSupervision

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

549 views11:51

Forwarded from Machine Learning with Python

🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://xn--r1a.website/+-WZeIeP8YI8wM2E6

You can join at this link! 👆👇

https://xn--r1a.website/+-WZeIeP8YI8wM2E6

❤1

358 views14:08

AI & ML Papers pinned a photo

15:45

458 views21:51

🔥 Lance: Unified Multimodal Modeling by Multi-Task Synergy

💡 The paper introduces Lance, a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos. The goal is to develop a model that can handle multiple tasks without relying on large model capacity or focusing on specific modalities like text or images. Lance achieves this through a dual-stream architecture and collaborative multi-task training, which enables joint context learning while separating the pathways for understanding and generation.

The model uses a mixture-of-experts architecture on shared multimodal sequences, allowing it to learn from both images and videos simultaneously. To address interference among different visual tokens, the model employs modality-aware rotary positional encoding, which helps to align tasks across different modalities.

During training, Lance uses a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling. This approach strengthens both semantic comprehension and visual generation performance. The results show that Lance outperforms existing unified models in image and video generation while maintaining strong multimodal understanding capabilities.

Overall, Lance presents a practical approach to unified multimodal modeling, demonstrating that collaborative multi-task training and a dual-stream architecture can lead to improved performance in multiple tasks without requiring large model capacity. The model has the potential to be applied to various applications that require multimodal understanding, generation, and editing capabilities.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18678
• PDF: https://arxiv.org/pdf/2605.18678
• Project Page: https://lance-project.github.io/

🤖 Models citing this paper:
• https://huggingface.co/bytedance-research/Lance

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Nayefleb/Lance

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalModeling #MultitaskLearning #DualStreamArchitecture #MixtureOfExperts #UnifiedModelingApproach

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

344 views21:51