AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤3

673 views01:52

466 views21:52

🔥 SkillOpt: Executive Strategy for Self-Evolving Agent Skills

💡 The paper introduces SkillOpt, a systematic approach to optimize agent skills through a text-space optimizer. Currently, agent skills are either hand-crafted, generated in one shot, or evolved through self-revision, which often results in unreliable improvements. SkillOpt addresses this issue by training skills as external state of a frozen agent, similar to how deep learning optimizers work.

The method involves a separate optimizer model that takes scored rollouts and applies bounded edits to a single skill document, accepting edits only when they improve a held-out validation score. To ensure stability, SkillOpt uses a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow updates, all of which add zero inference-time model calls at deployment.

The results show that SkillOpt outperforms existing methods across six benchmarks, seven target models, and three execution environments. It achieves the best or tied performance on all 52 evaluated cells and beats every competitor, including human, one-shot LLM, and other skill optimization methods. Notably, SkillOpt improves the average no-skill accuracy by 23.5 points on GPT-5.5 in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code.

Furthermore, transfer experiments demonstrate that optimized skill artifacts retain their value when moved across model scales, between different execution environments, and to nearby benchmarks without further optimization. Overall, SkillOpt provides a systematic and controllable approach to optimize agent skills, resulting in superior performance and reliable improvements.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

451 views21:52

319 views21:52

🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.

Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.

The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.

📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens

🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

451 views21:52

410 views21:53

🔥 PhotoFlow: Agentic 3D Virtual Photography Missions

💡 The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent that enables language-conditioned virtual photography in arbitrary 3D scenes. The problem addressed is to create an agent that can enter a 3D scene, infer a suitable shot based on scene information and language intent, and render a photograph without preselected camera pose or reference image. This task requires complex 3D spatial understanding and abstract aesthetic judgment, which are difficult to evaluate together.

The method proposed is a closed-loop camera search using the Director-Reviewer-Reflector agent. The Director builds a photographic blueprint and proposes candidate cameras, the Reviewer checks and critiques the proposals, and the Reflector converts failures into region memory and adjusts the search. The authors also introduce VPhotoBench, a benchmark of 47 open-license 3D scenes and 141 language-conditioned photography missions.

The results show that PhotoFlow achieves the strongest external quality-alignment composite and success rate among various methods, including one-shot prediction, single-chain reflection, anchor-bank selection, and random search, under a six-round rendering budget. The paper demonstrates that a language model-centered spatial agent can produce strong photographs in a setting that challenges both 3D reasoning and aesthetic choice, making language-conditioned virtual photography in arbitrary 3D scenes an executable agent task.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

500 views21:53

🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.

The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.

The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.

The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/

🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

607 views21:53

This media is not supported in your browser

1:06

617 views21:53

512 views07:53

🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.

📅 Published on Sep 30, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

648 views07:53

🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.

The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.

The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive

🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat

📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

552 views17:49

452 views17:50

❤1

366 viewsedited 19:50

🔥 WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

💡 The paper introduces WBench, a comprehensive benchmark for evaluating interactive world models. The problem addressed is that existing benchmarks for interactive world models are limited and do not provide a unified standard for evaluation. To fill this gap, the authors created WBench, which evaluates models across five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance.

The method used to create WBench involves 289 test cases and 1058 interaction turns, covering diverse scenarios and interaction types, including navigation, subject action, event editing, and perspective switching. The benchmark unifies different input interfaces, such as text, 6-DoF pose, and discrete-action control, allowing for the evaluation of models with different native input interfaces. The evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments.

The results show that no single model performs strongly across all dimensions. The authors evaluated 20 state-of-the-art models using WBench and found that each model has characteristic strengths, weaknesses, and open challenges. The paper provides detailed diagnostic insights into the performance of each model, highlighting areas for improvement. The code and data for WBench are made available, allowing other researchers to use the benchmark to evaluate and improve their own interactive world models. Overall, the paper contributes to the development of interactive world models by providing a comprehensive and unified benchmark for evaluation.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

428 views19:50

279 views21:50

🔥 ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

💡 The paper introduces ParaVT, a multi-agent reinforcement learning framework for parallel video tool calling, which enables the use of multiple video-processing tools simultaneously. This approach addresses the limitations of existing sequential methods, where a single incorrect tool call can propagate errors and corrupt context. The authors identify a key challenge in applying standard reinforcement learning to ParaVT, known as the Tool Prior Paradox, where pretrained tool priors enable tool exploration but also destabilize the model's structural format and create a shortcut for skipping tools.

To address this issue, the authors propose PARA-GRPO, a modified reinforcement learning algorithm that incorporates two complementary mechanisms: a targeted format reward and a per-prompt frame-budget randomization. The targeted format reward helps to stabilize the model's structural format, while the frame-budget randomization encourages the model to use tools in a way that yields a measurable reward signal.

The authors evaluate ParaVT with PARA-GRPO on six long-video understanding benchmarks and achieve an average improvement of 7.9% over the baseline Qwen3-VL model. Additionally, PARA-GRPO lifts training-time format compliance from 0.13 to 0.64, demonstrating the effectiveness of the proposed approach. The paper's contributions include a new framework for parallel video tool calling, a modified reinforcement learning algorithm, and a set of experimental results that demonstrate the benefits of the proposed approach. Overall, the paper provides a general recipe for agentic reinforcement learning that can be applied to a wide range of applications where tool capabilities are internalized in large multimodal models.

📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/

🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

313 views21:50

🔥 MiniCPM4: Ultra-Efficient LLMs on End Devices

💡 The paper introduces MiniCPM4, a highly efficient large language model designed for end-side devices. The goal is to achieve superior performance while being efficient, which is a challenge for large language models due to their computational requirements. To address this, the authors propose innovations in four key areas: model architecture, training data, training algorithms, and inference systems.

In terms of model architecture, the authors propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. For training data, they propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens.

The authors also propose ModelTunnel v2 for efficient pre-training strategy search and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient ternary LLM, BitCPM. For inference systems, they propose CPM.cu, which integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding.

The MiniCPM4 model is available in two versions, with 0.5B and 8B parameters, respectively. The evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences.

The results also show that MiniCPM4 can be adapted to power diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability. Overall, the paper presents a highly efficient large language model that achieves superior performance on end-side devices, making it a significant contribution to the field of natural language processing.

📅 Published on Jun 9, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2506.07900
• PDF: https://arxiv.org/pdf/2506.07900
• Project Page: https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b

🤖 Models citing this paper:
• https://huggingface.co/openbmb/MiniCPM4.1-8B
• https://huggingface.co/openbmb/MiniCPM5-1B
• https://huggingface.co/openbmb/MiniCPM4-8B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/openbmb/Ultra-FineWeb

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/openbmb/MiniCPM5-1B-Demo
• https://huggingface.co/spaces/openbmb/Ultra-FineWeb-L2-Selector
• https://huggingface.co/spaces/openbmb/MiniCPM4.1-8B-Demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#EfficientLLMs #LargeLanguageModels #SparseAttentionMechanisms #EndDeviceComputing #LowResourceNLP

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

375 views23:50

This media is not supported in your browser

0:00

358 views23:50

287 views01:50

🔥 Toward Native Multimodal Modeling: A Roadmap

💡 The paper presents a roadmap for native multimodal modeling, which integrates different modalities within a unified transformer framework, enabling seamless understanding and generation across diverse input-output configurations. Traditional approaches rely on late-fusion, where encoders and language backbones are assembled with output heads, but recent efforts have shifted towards native multimodal modeling for superior performance. However, the design space of native architectures remains poorly defined.

To address this, the authors formally define architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. They categorize existing native models into three categories: Multi-to-Text for cross-modal comprehension with text-only output, Multi-to-Target for scenario-oriented generation such as image, audio, and video generation, and Multi-to-Multi for unified modeling with symmetric input-output.

The authors provide a comprehensive investigation into the transition towards a definitive native multimodal modeling framework, where understanding and generation coexist within a unified transformer paradigm. They systematically examine the end-to-end pipeline, including architectural coordination, massive data curation, full-stack training recipes, inference and deployment, and comprehensive evaluation for truly native modeling.

The paper's contributions include a formalized roadmap for native multimodal modeling, a categorization of existing native models, and a comprehensive investigation into the transition towards a unified transformer framework. The results provide a foundation for the development of native multimodal models that can seamlessly understand and generate across diverse input-output configurations, representing a significant step towards world modeling and modality-agnostic reasoning.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25343
• PDF: https://arxiv.org/pdf/2605.25343
• Project Page: https://nmm-roadmap.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#NativeMultimodalModeling #MultimodalTransformerArchitectures #EarlyFusionTechniques #MidFusionApproaches #UnifiedTransformerFrameworks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

389 views01:50