AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

622 views19:52

553 views05:52

🔥 stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation

💡 The paper introduces stable-worldmodel, a modular and standardized research framework for developing and evaluating world models. World models are a powerful tool for learning compact representations of environment dynamics, enabling agents to reason and generalize beyond direct experience. However, current implementations are often publication-specific, which limits their reusability, increases the risk of bugs, and reduces evaluation standardization.

To address this issue, the authors developed stable-worldmodel, a tested and documented research ecosystem that provides efficient data collection tools, standardized environments, planning algorithms, and baseline implementations. The framework allows for controllable environmental factors, including visual and physical properties, to support robustness and continual learning research.

The authors demonstrate the utility of stable-worldmodel by using it to study zero-shot robustness in DINO-WM. The framework provides a standardized way to evaluate world models, which can help to advance research in this area. The main contributions of the paper are the introduction of a modular and standardized research framework for world models, the provision of efficient data collection tools and standardized environments, and the demonstration of the framework's utility in studying zero-shot robustness.

Overall, the paper aims to provide a reliable and reproducible research framework for world modeling, which can help to accelerate progress in this field. The authors' goal is to enable researchers to focus on developing new world models and evaluating their performance, rather than spending time on implementing and debugging existing models. By providing a standardized framework, the authors hope to facilitate the development of more robust and generalizable world models that can be used in a variety of applications.

📅 Published on Feb 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2602.08968
• PDF: https://arxiv.org/pdf/2602.08968
• Project Page: https://galilai-group.github.io/stable-worldmodel/

🤖 Models citing this paper:
• https://huggingface.co/zzsi/swm-dmc-cheetah
• https://huggingface.co/zzsi/swm-dmc-expert-policies

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zzsi/swm-dmc-expert
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-small
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-large

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#WorldModeling #ReinforcementLearning #ArtificialIntelligence #RoboticsResearch #EnvironmentModeling

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1👍1

595 views05:52

Forwarded from Data Analytics

📰 Anthropic is rolling out Claude Opus 4.8 🚀

The model has become significantly more honest in evaluating its own work and notices problems in its own code four times more often. 🔍✨

Plus, dynamic workflows have appeared — hundreds of AI subagents can work on large projects and migrations in parallel. 🤖⚡

⛓️ More details here
https://www.anthropic.com/news/claude-opus-4-8

#Anthropic #ClaudeOpus48 #AI #ArtificialIntelligence #TechNews #Innovation

✨ Join Best TG Channels https://xn--r1a.website/addlist/0f6vfFbEMdAwODBk

⭐️ Join Our WhatsApp Channel https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A

❤2👍1

458 views10:18

🚀 HelloEncyclo Presale is LIVE!

Master the skills that matter — Gen-AI, Data Science, Machine Learning and more — all in one place.

🎁 First 250 members get a flat 40% OFF

Use code: PRESALE-BOOK-WAVE-2GFG

✅ 13 full courses live right now

✅ 40+ more dropping in the next 2–3 weeks

✅ Complete library within 2 months — built and refined by industry experts

✅ 15-day money-back guarantee — don't love it? Get a full refund.

⚠️ Coupon works only after you log in with Gmail, and it's valid once per member.

👉 Log in now and start learning:

https://helloencyclo.com

Don't wait — the 40% deal disappears after the first 250 seats. 🔥

578 views12:25

AI & ML Papers pinned a photo

12:25

🔥 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models

💡 The paper presents a comprehensive framework called minWM for converting bidirectional video diffusion models into real-time interactive video world models. The problem addressed is that recent video diffusion foundation models have achieved high-quality video generation but turning them into real-time interactive world models remains challenging due to the need for controllable, causal, and low-latency capabilities.

The method used in minWM is a full-stack open-source framework that provides an end-to-end pipeline to convert existing bidirectional video foundation models into camera-controllable few-step autoregressive world models. This is achieved through fine-tuning and distillation techniques, including causal forcing, causal consistency distillation, and asymmetric DMD. The framework is modular and architecture-extensible, allowing it to be instantiated on different open backbones and adapted to new data distributions, training recipes, and latency targets.

The results of minWM are a real-time interactive video world model that can be controlled by a camera, with low-latency rollout and high-quality video generation. The framework is released with runnable scripts, checkpoints, documentation, and inference code, along with practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. Overall, minWM provides a reproducible and extensible recipe for building and adapting real-time interactive video world models, making it a valuable contribution to the field of video generation and interactive world modeling.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

588 views15:52

This media is not supported in your browser

0:43

472 views15:52

392 views15:52

🔥 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion

💡 The paper addresses the issue of modality sensitivity in vision-language models, which occurs when a model's performance degrades significantly when the modality of the input is changed, such as replacing a textual question with its rendered-image counterpart. This problem arises due to the inherent bias in current training corpora, where text and images are typically organized into distinct and asymmetric roles. To address this issue, the authors propose Local Modality Substitution, a data curation approach that provides supervision for cross-modal representational invariance between semantically equivalent text and image carriers. This method reformulates single-modality prompts into seamlessly interleaved multimodal sequences by dynamically selecting target text spans and recasting them as rendered images, thereby preserving the same semantics across different carriers. The authors evaluate their approach on 13 diverse multimodal benchmarks and demonstrate that it significantly improves overall multimodal reasoning and yields deeper cross-modal fusion, achieving consistent gains across foundational models. Specifically, the approach delivers improvements of 2.67 points on one model and 2.82 points on another, compared to standard methods. The proposed method is lightweight and architecture-agnostic, making it a valuable contribution to the field of vision-language models.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

535 views15:52

🔥 MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

💡 The paper introduces MobileGym, a browser-based mobile environment designed for mobile GUI agent research. The main problem addressed is the lack of a verifiable and highly parallel simulation platform for training and evaluating mobile GUI agents. Traditional methods are limited by their inability to provide deterministic outcome signals and scalable reinforcement learning.

The authors propose MobileGym as a solution, which enables deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution. The platform captures the full environment state as structured JSON, allowing for easy configuration, forking, and comparison of states. This approach enables a single server to host hundreds of parallel instances, with low memory requirements and fast startup times.

MobileGym features a layered state model and a declarative task-definition framework, making it practical to create and program tasks at scale. The platform also includes a single programmatic judging mechanism that delivers both deterministic evaluation verdicts and dense RL rewards. To facilitate research, the authors provide MobileGym-Bench, a collection of 416 parameterized task templates across 28 apps, including 256 test and 160 train templates.

The results demonstrate the effectiveness of MobileGym in a Sim-to-Real case study, where a model trained in the simulation environment achieves a 12.8 percentage point gain on a 256-task test set. When executed on real devices, the model retains 95.1% of the simulation-side training gain, indicating the potential of MobileGym for real-world applications. Overall, MobileGym provides a verifiable and highly parallel simulation platform for mobile GUI agent research, enabling scalable reinforcement learning and deterministic evaluation.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26114
• PDF: https://arxiv.org/pdf/2605.26114
• Project Page: https://mobilegym.github.io

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MobileGUIAgents #ParallelSimulation #ReinforcementLearning #MobileEnvironmentSimulation #GUIAgentResearch

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

👍1

840 views15:53

This media is not supported in your browser

0:30

VIEW IN TELEGRAM

❤1

886 views15:53

741 views01:53

🔥 GenClaw: Code-Driven Agentic Image Generation

💡 The paper introduces GenClaw, a code-driven agentic image generation framework that enables precise visual construction through a staged process. The problem with existing image generation models is that they are black-box systems that rely on text-conditioned pixel synthesis, leaving them with no direct mechanism to manipulate the canvas. This leads to a repetitive cycle of prompt rewriting for generation refinement, limiting their potential for precise visual construction.

The GenClaw method addresses this issue by empowering the agent to create like a human artist, through three stages: conceptualization, sketching, and coloring. In the conceptualization stage, the agent constructs conceptual knowledge and context through search and reasoning. The agent then utilizes code, such as SVG or HTML, to render executable visual sketches in the sketching stage. Finally, it employs an image generation model to supplement textures, materials, and photorealism in the coloring stage.

By using code as a controllable intermediate canvas, GenClaw bridges linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. This approach transforms image generation from a black-box paradigm into a staged process, offering a step toward highly controllable and interpretable visual generation systems. The results of GenClaw demonstrate a more precise and interpretable image generation process, allowing for direct manipulation of the canvas and overcoming the limitations of existing black-box image models.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30248
• PDF: https://arxiv.org/pdf/2605.30248

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticImageGeneration #CodeDrivenArt #StagedImageConstruction #VisualConstructionTechniques #ImageGenerationFrameworks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤5

914 views01:53

670 views21:53

🔥 VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

💡 The paper introduces VibeSearchBench, a benchmark for evaluating long-horizon proactive search in real-world scenarios. The problem addressed is the poor performance of large language model-based agents in search tasks that involve multi-turn dialogue and collaborative refinement of user intent. Existing benchmarks rely on over-specified queries, single-turn interactions, and fixed-schema evaluation, which do not reflect real search behavior.

To address this issue, the authors propose VibeSearch, a paradigm that involves multi-turn dialogue and collaborative refinement of vague user intent. The VibeSearchBench benchmark consists of 200 manually curated bilingual tasks across 20 domains, split into professional and daily-life subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework.

The authors benchmark seven frontier models under two different frameworks and find that all models perform poorly, with the best F1 score being 30.30. This highlights the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction. The paper's contributions include the introduction of the VibeSearch paradigm, the creation of the VibeSearchBench benchmark, and the evaluation of state-of-the-art models in this new benchmark, which reveals the significant gap between current models and real-world search requirements.

📅 Published on May 27

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.27882
• PDF: https://arxiv.org/pdf/2605.27882
• Project Page: https://vibebench.github.io/VibeSearchBench.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ProactiveSearch #LongHorizonSearch #MultiTurnDialogue #CollaborativeSearch #NaturalLanguageSearch

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

762 views21:53

596 views07:53

🔥 SkillNet: Create, Evaluate, and Connect AI Skills

💡 The paper introduces SkillNet, an open infrastructure designed to systematically accumulate and transfer artificial intelligence skills across multiple domains. The problem addressed is that current AI agents lack a unified mechanism for skill consolidation, resulting in redundant efforts and limited long-term advancement. To overcome this limitation, SkillNet structures skills within a unified ontology that supports creating skills from diverse sources, establishing connections, and evaluating skills across multiple dimensions such as safety, completeness, and cost awareness.

The SkillNet infrastructure consists of a repository of over 200,000 skills, an interactive platform, and a Python toolkit. This infrastructure enables the creation, evaluation, and organization of AI skills at scale. By formalizing skills as evolving and composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

The results of the paper demonstrate the effectiveness of SkillNet in enhancing agent performance. Experimental evaluations on various environments such as ALFWorld, WebShop, and ScienceWorld show that SkillNet significantly improves average rewards by 40 percent and reduces execution steps by 30 percent across multiple backbone models. Overall, the paper contributes to the development of a unified infrastructure for AI skill accumulation and transfer, which has the potential to accelerate the advancement of AI agents across multiple domains.

📅 Published on Feb 26

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.04448
• PDF: https://arxiv.org/pdf/2603.04448
• Project Page: http://skillnet.openkg.cn/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceSkills #AIInfrastructureDevelopment #SkillOntology #ArtificialGeneralIntelligence #TransferLearningMechanisms

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

744 views07:53

542 views03:50

🔥 OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

💡 The paper proposes a new method called OSCAR for ultra-low-bit key-value cache quantization, which is crucial for efficient deployment of large language models. The problem addressed is that existing quantization methods, such as simple rotations like Hadamard transforms, degrade in accuracy when applied to very low-bit representations, like 2-bit integers. This degradation occurs because these methods do not account for the attention-aware covariance structures that the model actually uses.

To solve this problem, OSCAR estimates the attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. This approach aligns the key-value cache quantization with the covariance structures that the model consumes, leading to higher accuracy and efficiency.

The authors provide theoretical justification for OSCAR and develop a fully deployable system that is compatible with modern large language model serving frameworks. They evaluate OSCAR on several reasoning models with long context lengths, up to 32,000 tokens, and achieve significant improvements in accuracy compared to naive rotation methods. Specifically, OSCAR reduces the accuracy gap to 3.78 and 1.42 points on two models, while naive rotation methods collapse to nearly zero.

The results also show that OSCAR scales well to larger models, remaining effectively on par with higher-precision representations. Additionally, OSCAR achieves significant system-wise improvements, including reducing key-value cache memory by approximately 8 times, improving throughput by up to 7 times, and accelerating batch-size-1 decoding by up to 3 times over higher-precision representations. Overall, the paper demonstrates that OSCAR is an effective and efficient method for ultra-low-bit key-value cache quantization, enabling the deployment of large language models with high accuracy and efficiency.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.17757
• PDF: https://arxiv.org/pdf/2605.17757
• Project Page: https://oscar-quantize.github.io/

🤖 Models citing this paper:
• https://huggingface.co/Zhongzhu/OSCAR-RotationZoo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#QuantizationMethods #LowBitRepresentations #KeyvalueCache #SpectralCovariance #EfficientDeployment

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤3

620 views03:50