AI & ML Papers
Photo
🔥 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
📅 Published on Jul 25, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/
🤖 Models citing this paper:
• https://huggingface.co/pirola/local-ai-coding-stack-research
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning
💡 The paper introduces GEPA, a prompt optimizer that uses natural language reflection to learn high level rules from trial and error, outperforming reinforcement learning methods. The problem addressed is that current reinforcement learning methods, such as Group Relative Policy Optimization, require thousands of rollouts to learn new tasks, which can be time consuming and inefficient. The authors argue that the interpretable nature of language can provide a richer learning medium for large language models compared to policy gradients derived from sparse scalar rewards.
The method used is GEPA, a Genetic-Pareto prompt optimizer that incorporates natural language reflection to learn high level rules from trial and error. GEPA samples system level trajectories, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines complementary lessons from its own attempts. This approach allows GEPA to turn even a few rollouts into a large quality gain.
The results show that GEPA outperforms Group Relative Policy Optimization by 10 percent on average and by up to 20 percent, while using up to 35 times fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percent across two large language models. Additionally, GEPA demonstrates promising results as an inference time search strategy for code optimization. Overall, the paper contributes a new approach to prompt optimization that can efficiently learn high level rules from trial and error, outperforming current reinforcement learning methods.
📅 Published on Jul 25, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/
🤖 Models citing this paper:
• https://huggingface.co/pirola/local-ai-coding-stack-research
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤3
AI & ML Papers
Photo
🔥 SkillOpt: Executive Strategy for Self-Evolving Agent Skills
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization
💡 The paper introduces SkillOpt, a systematic approach to optimize agent skills through a text-space optimizer. Currently, agent skills are either hand-crafted, generated in one shot, or evolved through self-revision, which often results in unreliable improvements. SkillOpt addresses this issue by training skills as external state of a frozen agent, similar to how deep learning optimizers work.
The method involves a separate optimizer model that takes scored rollouts and applies bounded edits to a single skill document, accepting edits only when they improve a held-out validation score. To ensure stability, SkillOpt uses a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow updates, all of which add zero inference-time model calls at deployment.
The results show that SkillOpt outperforms existing methods across six benchmarks, seven target models, and three execution environments. It achieves the best or tied performance on all 52 evaluated cells and beats every competitor, including human, one-shot LLM, and other skill optimization methods. Notably, SkillOpt improves the average no-skill accuracy by 23.5 points on GPT-5.5 in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code.
Furthermore, transfer experiments demonstrate that optimized skill artifacts retain their value when moved across model scales, between different execution environments, and to nearby benchmarks without further optimization. Overall, SkillOpt provides a systematic and controllable approach to optimize agent skills, resulting in superior performance and reliable improvements.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
📅 Published on May 20
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens
🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling
💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.
Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.
The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.
📅 Published on May 20
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens
🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 PhotoFlow: Agentic 3D Virtual Photography Missions
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems
💡 The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent that enables language-conditioned virtual photography in arbitrary 3D scenes. The problem addressed is to create an agent that can enter a 3D scene, infer a suitable shot based on scene information and language intent, and render a photograph without preselected camera pose or reference image. This task requires complex 3D spatial understanding and abstract aesthetic judgment, which are difficult to evaluate together.
The method proposed is a closed-loop camera search using the Director-Reviewer-Reflector agent. The Director builds a photographic blueprint and proposes candidate cameras, the Reviewer checks and critiques the proposals, and the Reflector converts failures into region memory and adjusts the search. The authors also introduce VPhotoBench, a benchmark of 47 open-license 3D scenes and 141 language-conditioned photography missions.
The results show that PhotoFlow achieves the strongest external quality-alignment composite and success rate among various methods, including one-shot prediction, single-chain reflection, anchor-bank selection, and random search, under a six-round rendering budget. The paper demonstrates that a language model-centered spatial agent can produce strong photographs in a setting that challenges both 3D reasoning and aesthetic choice, making language-conditioned virtual photography in arbitrary 3D scenes an executable agent task.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.
The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.
The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.
The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive
🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision
💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.
The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.
The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive
🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis
💡 The paper introduces WBench, a comprehensive benchmark for evaluating interactive world models. The problem addressed is that existing benchmarks for interactive world models are limited and do not provide a unified standard for evaluation. To fill this gap, the authors created WBench, which evaluates models across five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance.
The method used to create WBench involves 289 test cases and 1058 interaction turns, covering diverse scenarios and interaction types, including navigation, subject action, event editing, and perspective switching. The benchmark unifies different input interfaces, such as text, 6-DoF pose, and discrete-action control, allowing for the evaluation of models with different native input interfaces. The evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments.
The results show that no single model performs strongly across all dimensions. The authors evaluated 20 state-of-the-art models using WBench and found that each model has characteristic strengths, weaknesses, and open challenges. The paper provides detailed diagnostic insights into the performance of each model, highlighting areas for improvement. The code and data for WBench are made available, allowing other researchers to use the benchmark to evaluate and improve their own interactive world models. Overall, the paper contributes to the development of interactive world models by providing a comprehensive and unified benchmark for evaluation.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
📅 Published on May 19
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/
🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox
💡 The paper introduces ParaVT, a multi-agent reinforcement learning framework for parallel video tool calling, which enables the use of multiple video-processing tools simultaneously. This approach addresses the limitations of existing sequential methods, where a single incorrect tool call can propagate errors and corrupt context. The authors identify a key challenge in applying standard reinforcement learning to ParaVT, known as the Tool Prior Paradox, where pretrained tool priors enable tool exploration but also destabilize the model's structural format and create a shortcut for skipping tools.
To address this issue, the authors propose PARA-GRPO, a modified reinforcement learning algorithm that incorporates two complementary mechanisms: a targeted format reward and a per-prompt frame-budget randomization. The targeted format reward helps to stabilize the model's structural format, while the frame-budget randomization encourages the model to use tools in a way that yields a measurable reward signal.
The authors evaluate ParaVT with PARA-GRPO on six long-video understanding benchmarks and achieve an average improvement of 7.9% over the baseline Qwen3-VL model. Additionally, PARA-GRPO lifts training-time format compliance from 0.13 to 0.64, demonstrating the effectiveness of the proposed approach. The paper's contributions include a new framework for parallel video tool calling, a modified reinforcement learning algorithm, and a set of experimental results that demonstrate the benefits of the proposed approach. Overall, the paper provides a general recipe for agentic reinforcement learning that can be applied to a wide range of applications where tool capabilities are internalized in large multimodal models.
📅 Published on May 19
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/
🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 MiniCPM4: Ultra-Efficient LLMs on End Devices
📅 Published on Jun 9, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2506.07900
• PDF: https://arxiv.org/pdf/2506.07900
• Project Page: https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b
🤖 Models citing this paper:
• https://huggingface.co/openbmb/MiniCPM4.1-8B
• https://huggingface.co/openbmb/MiniCPM5-1B
• https://huggingface.co/openbmb/MiniCPM4-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/openbmb/Ultra-FineWeb
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/openbmb/MiniCPM5-1B-Demo
• https://huggingface.co/spaces/openbmb/Ultra-FineWeb-L2-Selector
• https://huggingface.co/spaces/openbmb/MiniCPM4.1-8B-Demo
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#EfficientLLMs #LargeLanguageModels #SparseAttentionMechanisms #EndDeviceComputing #LowResourceNLP
💡 The paper introduces MiniCPM4, a highly efficient large language model designed for end-side devices. The goal is to achieve superior performance while being efficient, which is a challenge for large language models due to their computational requirements. To address this, the authors propose innovations in four key areas: model architecture, training data, training algorithms, and inference systems.
In terms of model architecture, the authors propose InfLLM v2, a trainable sparse attention mechanism that accelerates both prefilling and decoding phases for long-context processing. For training data, they propose UltraClean, an efficient and accurate pre-training data filtering and generation strategy, and UltraChat v2, a comprehensive supervised fine-tuning dataset. These datasets enable satisfactory model performance to be achieved using just 8 trillion training tokens.
The authors also propose ModelTunnel v2 for efficient pre-training strategy search and improve existing post-training methods by introducing chunk-wise rollout for load-balanced reinforcement learning and data-efficient ternary LLM, BitCPM. For inference systems, they propose CPM.cu, which integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding.
The MiniCPM4 model is available in two versions, with 0.5B and 8B parameters, respectively. The evaluation results show that MiniCPM4 outperforms open-source models of similar size across multiple benchmarks, highlighting both its efficiency and effectiveness. Notably, MiniCPM4-8B demonstrates significant speed improvements over Qwen3-8B when processing long sequences.
The results also show that MiniCPM4 can be adapted to power diverse applications, including trustworthy survey generation and tool use with model context protocol, clearly showcasing its broad usability. Overall, the paper presents a highly efficient large language model that achieves superior performance on end-side devices, making it a significant contribution to the field of natural language processing.
📅 Published on Jun 9, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2506.07900
• PDF: https://arxiv.org/pdf/2506.07900
• Project Page: https://huggingface.co/collections/openbmb/minicpm4-6841ab29d180257e940baa9b
🤖 Models citing this paper:
• https://huggingface.co/openbmb/MiniCPM4.1-8B
• https://huggingface.co/openbmb/MiniCPM5-1B
• https://huggingface.co/openbmb/MiniCPM4-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/openbmb/Ultra-FineWeb
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/openbmb/MiniCPM5-1B-Demo
• https://huggingface.co/spaces/openbmb/Ultra-FineWeb-L2-Selector
• https://huggingface.co/spaces/openbmb/MiniCPM4.1-8B-Demo
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#EfficientLLMs #LargeLanguageModels #SparseAttentionMechanisms #EndDeviceComputing #LowResourceNLP
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 Toward Native Multimodal Modeling: A Roadmap
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25343
• PDF: https://arxiv.org/pdf/2605.25343
• Project Page: https://nmm-roadmap.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NativeMultimodalModeling #MultimodalTransformerArchitectures #EarlyFusionTechniques #MidFusionApproaches #UnifiedTransformerFrameworks
💡 The paper presents a roadmap for native multimodal modeling, which integrates different modalities within a unified transformer framework, enabling seamless understanding and generation across diverse input-output configurations. Traditional approaches rely on late-fusion, where encoders and language backbones are assembled with output heads, but recent efforts have shifted towards native multimodal modeling for superior performance. However, the design space of native architectures remains poorly defined.
To address this, the authors formally define architectural nativity, distinguishing mid-fusion and early-fusion from non-native paradigms. They categorize existing native models into three categories: Multi-to-Text for cross-modal comprehension with text-only output, Multi-to-Target for scenario-oriented generation such as image, audio, and video generation, and Multi-to-Multi for unified modeling with symmetric input-output.
The authors provide a comprehensive investigation into the transition towards a definitive native multimodal modeling framework, where understanding and generation coexist within a unified transformer paradigm. They systematically examine the end-to-end pipeline, including architectural coordination, massive data curation, full-stack training recipes, inference and deployment, and comprehensive evaluation for truly native modeling.
The paper's contributions include a formalized roadmap for native multimodal modeling, a categorization of existing native models, and a comprehensive investigation into the transition towards a unified transformer framework. The results provide a foundation for the development of native multimodal models that can seamlessly understand and generate across diverse input-output configurations, representing a significant step towards world modeling and modality-agnostic reasoning.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25343
• PDF: https://arxiv.org/pdf/2605.25343
• Project Page: https://nmm-roadmap.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NativeMultimodalModeling #MultimodalTransformerArchitectures #EarlyFusionTechniques #MidFusionApproaches #UnifiedTransformerFrameworks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.