AI & ML Papers

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

414 views15:53

404 views15:53

404 views15:53

🔥 Masked Visual Actions for Unified World Modeling

💡 The paper introduces Masked Visual Actions for unified world modeling, a method that enables video models to learn how the visual world moves, interacts, and responds to contact, making them promising substrates for robotic world modeling. The central challenge addressed is how to communicate action to such models in a form aligned with the visual space in which they learned interaction priors, yet still grounded in physical manipulation.

The proposed method, Masked Visual Actions, expresses action as a partially revealed trajectory of an arbitrary entity in a video, using a pixel-space control interface. This allows the model to act as a forward dynamics model that predicts the scene's response to low-level robot actions, while also recovering robot behavior consistent with a desired outcome.

The method is fine-tuned with only 15 hours of masked examples from real videos and simulation, and achieves strong visual fidelity and controllability across diverse scenes and multiple embodiments. The model produces imagined rollouts whose outcomes correlate with real-world execution for policy evaluation, improves decision-making by ranking candidate futures in model-based planning, and supports inverse modeling by synthesizing robot motion from desired object motion.

The contributions of the paper include a novel method for communicating action to video models, a pixel-space control interface for expressing action, and a model that can predict scene responses to robot actions and recover robot behavior consistent with desired outcomes. The results demonstrate the effectiveness of the method in achieving strong visual fidelity and controllability, and its potential applications in downstream manipulation settings, such as policy evaluation, model-based planning, and inverse modeling.

📅 Published on Jul 21

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.19343
• PDF: https://arxiv.org/pdf/2607.19343
• Project Page: https://masked-visual-actions.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MaskedVisualActions #UnifiedWorldModeling #VideoModels #RoboticWorldModeling #ForwardDynamicsModel

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤1

515 views15:53

426 views01:54

🔥 Generative World Renderer at the Speed of Play

💡 The paper introduces Alaya Renderer Flash, a real-time generative forward world renderer that significantly improves the speed of the original Alaya Renderer. The original Alaya Renderer was too computationally expensive for real-time deployment, running at only 0.56 frames per second. In contrast, Alaya Renderer Flash achieves 31.54 frames per second, making it suitable for interactive world modeling and user-controllable play.

The key contribution of Alaya Renderer Flash is its reformulation of the original renderer as a few-step autoregressive streaming model, which enables efficient latent encoding and frame reconstruction. This approach preserves the scene structure without altering the underlying world dynamics, unlike models that generate frames from text or control hints. Additionally, Alaya Renderer Flash retains the teacher model's G-buffer and text-prompt interfaces while enabling continuous rendering over input streams of unbounded length.

The authors evaluate Alaya Renderer Flash on G-buffer streams across various metrics, including content preservation, temporal consistency, cross-window stability, prompt controllability, and runtime efficiency. The results show that Alaya Renderer Flash substantially reduces inference cost while preserving the core rendering capabilities of the teacher model. By integrating Alaya Renderer Flash with a physics engine, the authors build a fully playable generative world running at 30 frames per second, demonstrating the potential of this approach for real-time interactive applications. Overall, Alaya Renderer Flash offers a promising alternative path towards interactive world modeling and user-controllable play.

📅 Published on Jul 21

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.18703
• PDF: https://arxiv.org/pdf/2607.18703
• Project Page: https://alaya-renderer-flash.alayalab.ai/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GenerativeRendering #RealTimeWorldModeling #AutoregressiveStreaming #ForwardRenderingTechniques #InteractiveWorldModeling

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

467 views01:54

🚀 Stop Maintaining Scrapers. Start Shipping Products.

Build AI products, not scraping infrastructure.

CoreClaw provides ready-to-use Workers & APIs for 1000+ websites — including Google Maps, Instagram, Facebook, YouTube, Amazon, Tiktok and Google Search Scraper.

✔️ No infrastructure
✔️ No proxy management
✔️ No scraper maintenance
✔️ JSON / CSV / REST API

🎁 Create a free account. Get free credits. Explore every Worker.

👉 https://coreclaw.com

❤1

464 views05:53

AI & ML Papers pinned a photo

05:53

Your AI helper right in your messenger — in 5 minutes, free

Amplify (UK) plugs an AI agent straight into your Telegram, WhatsApp, Slack, WeChat, or Discord. Not just a GPT chat — an assistant that reaches into the real world.

Handles it all: emails, reminders, spreadsheets, Telegram-channel digests, image and video generation, PDFs, Google Drive, Notion. Send it voice notes on the go — it gets everything.

Pricing: $10/mo + pay-as-you-go for the AI model, all costs transparent and tracked. Already have OpenAI subscription? Link it and skip paying for the model.

🎁 Promo code CODEPROGRAMMER2 → 2 months free + $10 credit. Bring someone in — another month free.

https://getamplify.team/

❤1🔥1👏1

465 views10:41

AI & ML Papers pinned a photo

10:47

🔥 ABot-World-0: Infinite Interactive World Rollout on a Single Desktop GPU

💡 The paper presents ABot-World-0, a system for real-time, long-horizon, closed-loop interaction in a virtual world. The system is trained on a large dataset of videos, games, and simulation engines to learn controllable world dynamics. The authors propose a multi-source data infrastructure to collect and process data, and a unified pipeline to apply quality checks, assessment, and synchronization of actions and text annotations.

The system uses a teacher-forcing approach to train an action-conditioned video world model, which is then distilled into a causal student model through a process of teacher forcing and ODE distillation. The authors also introduce Long Forcing, a method to align long student self-rollouts with an extended-horizon teacher, mitigating accumulated distribution shift and autoregressive drift.

The system provides a unified control interface for scene roaming and third-person character interaction, and uses reference-character memory to provide persistent appearance cues for identity consistency during third-person rollouts. The authors also co-design a streaming inference stack with a lightweight VAE decoder, efficient attention, memory-aware scheduling, and low-bit DIT inference.

The results show that ABot-World-0 can stream 720p video at up to 16 frames per second on a single NVIDIA RTX 5090 desktop GPU, with 1.2 seconds action-to-first-frame latency and approximately 19 GB peak VRAM. Experiments on World Roam Benchmark and extended interactive rollouts demonstrate competitive controllability and coherent long-horizon world evolution. Overall, the paper presents a novel approach to real-time, long-horizon, closed-loop interaction in virtual worlds, with potential applications in fields such as robotics, gaming, and simulation.

📅 Published on Jul 21

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.19191
• PDF: https://arxiv.org/pdf/2607.19191
• Project Page: https://abot-world.amap.com/

🤖 Models citing this paper:
• https://huggingface.co/acvlab/ABot-World-0-5B-LF

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/acvlab/abot-world-interactive

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VirtualWorldSimulation #InteractiveWorldModels #RealTimeWorldDynamics #ClosedLoopInteraction #ArtificialIntelligenceForGames

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤1

482 views11:54

This media is not supported in your browser

0:05

VIEW IN TELEGRAM

489 viewsedited 11:54

336 viewsedited 21:54

🔥 Self Gradient Forcing: Native Long Video Extrapolation

💡 The paper proposes a new method called Self Gradient Forcing for native long video extrapolation. Recent autoregressive video diffusion methods are built upon Self Forcing, where the student is trained on histories produced by its own rollout rather than ground-truth video contexts. However, this approach has a limitation known as the historical context-gradient gap, where future losses cannot supervise how earlier generated latents should be written into more useful keys and values for later video-latent generation.

To address this issue, the authors propose a two-pass training strategy called Self Gradient Forcing. The first pass performs a no-gradient autoregressive rollout matching inference and records both the self-generated context and the noisy latents fed to the model at a sampled denoising exit step. The second pass performs parallel context-gradient reconstruction for the recorded exit step. The generated context is used as a stop-gradient clean-latent input, while the model recomputes the context KV representations and future-to-context causal attention.

The proposed method provides the missing memory-writing supervision within the native autoregressive training objective, using losses on future video latents to train the model to encode context into more effective causal memory. The authors evaluate their method across extensive long-horizon frame-wise and chunk-wise experiments under different initializations and achieve stronger native long-video extrapolation than Self Forcing, especially in subject identity, background/layout consistency, and temporal stability. Notably, using only a 5-second training window, Self Gradient Forcing can extrapolate to videos lasting several minutes.

📅 Published on Jul 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.20368
• PDF: https://arxiv.org/pdf/2607.20368
• Project Page: https://zhuang2002.github.io/SelfGradientForcing/

🤖 Models citing this paper:
• https://huggingface.co/JunhaoZhuang/Self_Gradient_Forcing

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoExtrapolation #AutoregressiveVideoDiffusion #SelfGradientForcing #LongVideoGeneration #VideoDiffusionMethods

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

426 views21:54

398 views21:54

🔥 Beyond Relevance-Centric Retrieval: Rubric-Oriented Document Set Selection and Ranking

💡 The paper addresses the issue of document set selection and ranking, which is crucial for large language models and AI agents that rely on search results. Existing evaluation systems score documents independently and aggregate them using metrics like DCG, ignoring interactions between documents such as redundancy, conflict, and complementarity. This limitation makes it difficult to determine what makes one document set better than another.

To address this issue, the authors propose a comprehensive evaluate-diagnose-optimize framework. They design Setwise Eval Kit, a three-level, nine-dimension document set evaluation benchmark that covers both short-form and long-form scenarios, comprising approximately 28,000 high-quality evaluation rubrics.

The authors systematically evaluate 12 rerankers and find that even the best method achieves no more than 45 percent coverage, and cross-document coordination dimensions are universally weak. No single method maintains top performance across both settings.

Building on this, the authors propose Rubric4Setwise, a training-free method that converts rubric-based evaluation criteria into document set selection signals. This method achieves the best downstream generation performance with fewer documents and search rounds. It is the only method that maintains state-of-the-art results across both scenarios, validating the effectiveness of closing the loop from evaluation to optimization.

The paper's contributions include a comprehensive evaluation framework, a new benchmark for document set evaluation, and a novel method for document set selection and ranking that outperforms existing methods. The results demonstrate the importance of considering cross-document interactions and using rubric-based evaluation criteria to improve document set selection and ranking.

📅 Published on Jul 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.19747
• PDF: https://arxiv.org/pdf/2607.19747
• Project Page: https://rubric4setwise.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentSetSelection #RubricOrientedRanking #InformationRetrieval #DocumentEvaluation #SetwiseOptimization

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

539 views21:54

406 views17:54

🔥 Color Pass-Through via Camera-Display Coupling

💡 The paper Color Pass-Through via Camera-Display Coupling addresses the issue of color discrepancy between the original scene and its displayed image on a smartphone screen. Despite advances in camera and display technology, the displayed image often differs noticeably from the original scene in terms of color, brightness, and contrast. This is because most pipelines separate the high-dimensional capture-to-display process into two stages, calibrating the camera and display separately and then connecting them through low-dimensional color transforms, which leads to information bottlenecks and error accumulation.

To overcome this challenge, the authors propose Color Pass-Through, an end-to-end learned framework that operates directly on captured images. The key insight is to treat the camera and display as a coupled system rather than calibrating them in isolation. By coupling the camera and display, the authors achieve two practical advantages: it brings the entire real-world scene to the display via end-to-end optimization, and it allows for efficient one-step calibration for each distinct observer via the complete capture-to-display path.

The authors validate Color Pass-Through using both digital and human observers. Compared to representative baselines, their method achieves an average gain of 2.0 points on a 5-point user study and more than 2x improvement on quantitative metrics, demonstrating improved reproduction of the perceived color of the original scene. The results show that the proposed approach can effectively reduce the color discrepancy between the original scene and its displayed image, leading to a more accurate and faithful representation of the scene.

📅 Published on Jul 14

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.12746
• PDF: https://arxiv.org/pdf/2607.12746
• Project Page: https://lyricccco.github.io/color-pass-through/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ColorPassThrough #CameraDisplayCoupling #ColorDiscrepancyCorrection #DisplayColorCalibration #CaptureToDisplayProcessing

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

492 views17:54

399 views03:55

🔥 Native and Compact Structured Latents for 3D Generation

💡 This paper addresses the challenge of 3D generative modeling where existing representations struggle to capture complex topologies and detailed appearance of 3D assets. To overcome this, the authors introduce a new sparse voxel representation called O-Voxel, which encodes both geometry and appearance of 3D objects. O-Voxel can robustly model arbitrary topology, including open, non-manifold, and fully-enclosed surfaces, and captures comprehensive surface attributes. The authors design a Sparse Compression VAE based on O-Voxel, which provides a high spatial compression rate and a compact latent space. They train large-scale models with 4B parameters on diverse public 3D asset datasets and achieve highly efficient inference. The results show that the generated assets have significantly better geometry and material quality compared to existing models. The approach offers a significant advancement in 3D generative modeling by enabling high-quality generation with efficient inference and robust topology handling.

📅 Published on Dec 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.14692
• PDF: https://arxiv.org/pdf/2512.14692
• Project Page: https://microsoft.github.io/TRELLIS.2/

🤖 Models citing this paper:
• https://huggingface.co/microsoft/TRELLIS.2-4B
• https://huggingface.co/mancub/TRELLIS.2-4B
• https://huggingface.co/Jinstudio/TRELLIS.2-4B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/serpentine-b/t2

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/microsoft/TRELLIS.2
• https://huggingface.co/spaces/TencentARC/Pixal3D
• https://huggingface.co/spaces/broyang/3dai

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DGenerativeModeling #SparseVoxelRepresentation #CompactLatentSpace #3DAssetGeneration #GeometricDeepLearning

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

❤2

487 views03:55