AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

638 views15:51

483 views01:52

🔥 OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

💡 OpenGuardrails is an open source project that provides a unified model for detecting content safety and model manipulation risks in large language models. The project aims to address the critical issue of safeguarding large language models against unsafe, malicious, or privacy violating content. The OpenGuardrails platform offers a comprehensive solution that includes a context aware safety and manipulation detection model, as well as a separate named entity recognition pipeline for identifying and redacting sensitive data.

The platform protects against various types of risks, including content safety risks, model manipulation attacks such as prompt injection and jailbreaking, and data leakage. The content safety and model manipulation detection are implemented using a unified large model, while data leakage identification and redaction are performed using a separate lightweight named entity recognition pipeline.

The OpenGuardrails system can be deployed in various ways, including as a security gateway or an API based service, with enterprise grade deployment options that ensure fully private deployment. The project achieves state of the art performance on safety benchmarks, excelling in both prompt and response classification across multiple languages, including English, Chinese, and multilingual tasks.

The key contributions of the OpenGuardrails project include providing a unified model for content safety and model manipulation detection, offering a separate named entity recognition pipeline for data leakage identification and redaction, and achieving state of the art performance on safety benchmarks. The project also makes all models available under the Apache 2.0 license for public use, allowing for widespread adoption and further development of the technology. Overall, OpenGuardrails provides a comprehensive and effective solution for safeguarding large language models against various types of risks, and its open source nature makes it a valuable resource for the data science community.

📅 Published on Oct 22, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.19169
• PDF: https://arxiv.org/pdf/2510.19169
• Project Page: https://openguardrails.com

🤖 Models citing this paper:
• https://huggingface.co/openguardrails/OpenGuardrails-Text-2510
• https://huggingface.co/openguardrails/OpenGuardrails-Text-4B-0124

📊 Datasets citing this paper:
• https://huggingface.co/datasets/openguardrails/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/qtqtqtqt/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/ruishen123/OpenGuardrailsMixZh_97k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ContextAwareAI #LargeLanguageModels #ContentSafety #ModelManipulation #NamedEntityRecognition

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

601 views01:52

532 views01:52

🔥 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

💡 The paper introduces GEPA, a prompt optimizer that uses natural language reflection to learn high level rules from trial and error, outperforming reinforcement learning methods. The problem addressed is that current reinforcement learning methods, such as Group Relative Policy Optimization, require thousands of rollouts to learn new tasks, which can be time consuming and inefficient. The authors argue that the interpretable nature of language can provide a richer learning medium for large language models compared to policy gradients derived from sparse scalar rewards.

The method used is GEPA, a Genetic-Pareto prompt optimizer that incorporates natural language reflection to learn high level rules from trial and error. GEPA samples system level trajectories, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines complementary lessons from its own attempts. This approach allows GEPA to turn even a few rollouts into a large quality gain.

The results show that GEPA outperforms Group Relative Policy Optimization by 10 percent on average and by up to 20 percent, while using up to 35 times fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percent across two large language models. Additionally, GEPA demonstrates promising results as an inference time search strategy for code optimization. Overall, the paper contributes a new approach to prompt optimization that can efficiently learn high level rules from trial and error, outperforming current reinforcement learning methods.

📅 Published on Jul 25, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/

🤖 Models citing this paper:
• https://huggingface.co/pirola/local-ai-coding-stack-research

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤3

673 views01:52

466 views21:52

🔥 SkillOpt: Executive Strategy for Self-Evolving Agent Skills

💡 The paper introduces SkillOpt, a systematic approach to optimize agent skills through a text-space optimizer. Currently, agent skills are either hand-crafted, generated in one shot, or evolved through self-revision, which often results in unreliable improvements. SkillOpt addresses this issue by training skills as external state of a frozen agent, similar to how deep learning optimizers work.

The method involves a separate optimizer model that takes scored rollouts and applies bounded edits to a single skill document, accepting edits only when they improve a held-out validation score. To ensure stability, SkillOpt uses a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow updates, all of which add zero inference-time model calls at deployment.

The results show that SkillOpt outperforms existing methods across six benchmarks, seven target models, and three execution environments. It achieves the best or tied performance on all 52 evaluated cells and beats every competitor, including human, one-shot LLM, and other skill optimization methods. Notably, SkillOpt improves the average no-skill accuracy by 23.5 points on GPT-5.5 in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code.

Furthermore, transfer experiments demonstrate that optimized skill artifacts retain their value when moved across model scales, between different execution environments, and to nearby benchmarks without further optimization. Overall, SkillOpt provides a systematic and controllable approach to optimize agent skills, resulting in superior performance and reliable improvements.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

451 views21:52

319 views21:52

🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.

Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.

The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.

📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens

🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

451 views21:52

410 views21:53

🔥 PhotoFlow: Agentic 3D Virtual Photography Missions

💡 The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent that enables language-conditioned virtual photography in arbitrary 3D scenes. The problem addressed is to create an agent that can enter a 3D scene, infer a suitable shot based on scene information and language intent, and render a photograph without preselected camera pose or reference image. This task requires complex 3D spatial understanding and abstract aesthetic judgment, which are difficult to evaluate together.

The method proposed is a closed-loop camera search using the Director-Reviewer-Reflector agent. The Director builds a photographic blueprint and proposes candidate cameras, the Reviewer checks and critiques the proposals, and the Reflector converts failures into region memory and adjusts the search. The authors also introduce VPhotoBench, a benchmark of 47 open-license 3D scenes and 141 language-conditioned photography missions.

The results show that PhotoFlow achieves the strongest external quality-alignment composite and success rate among various methods, including one-shot prediction, single-chain reflection, anchor-bank selection, and random search, under a six-round rendering budget. The paper demonstrates that a language model-centered spatial agent can produce strong photographs in a setting that challenges both 3D reasoning and aesthetic choice, making language-conditioned virtual photography in arbitrary 3D scenes an executable agent task.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

500 views21:53

🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.

The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.

The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.

The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.

📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/

🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

607 views21:53

This media is not supported in your browser

1:06

617 views21:53

512 views07:53

🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models

💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.

📅 Published on Sep 30, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

648 views07:53

🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.

The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.

The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive

🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat

📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

552 views17:49

452 views17:50

❤1

366 viewsedited 19:50

🔥 WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation

💡 The paper introduces WBench, a comprehensive benchmark for evaluating interactive world models. The problem addressed is that existing benchmarks for interactive world models are limited and do not provide a unified standard for evaluation. To fill this gap, the authors created WBench, which evaluates models across five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance.

The method used to create WBench involves 289 test cases and 1058 interaction turns, covering diverse scenarios and interaction types, including navigation, subject action, event editing, and perspective switching. The benchmark unifies different input interfaces, such as text, 6-DoF pose, and discrete-action control, allowing for the evaluation of models with different native input interfaces. The evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments.

The results show that no single model performs strongly across all dimensions. The authors evaluated 20 state-of-the-art models using WBench and found that each model has characteristic strengths, weaknesses, and open challenges. The paper provides detailed diagnostic insights into the performance of each model, highlighting areas for improvement. The code and data for WBench are made available, allowing other researchers to use the benchmark to evaluate and improve their own interactive world models. Overall, the paper contributes to the development of interactive world models by providing a comprehensive and unified benchmark for evaluation.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

428 views19:50

279 views21:50

🔥 ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning

💡 The paper introduces ParaVT, a multi-agent reinforcement learning framework for parallel video tool calling, which enables the use of multiple video-processing tools simultaneously. This approach addresses the limitations of existing sequential methods, where a single incorrect tool call can propagate errors and corrupt context. The authors identify a key challenge in applying standard reinforcement learning to ParaVT, known as the Tool Prior Paradox, where pretrained tool priors enable tool exploration but also destabilize the model's structural format and create a shortcut for skipping tools.

To address this issue, the authors propose PARA-GRPO, a modified reinforcement learning algorithm that incorporates two complementary mechanisms: a targeted format reward and a per-prompt frame-budget randomization. The targeted format reward helps to stabilize the model's structural format, while the frame-budget randomization encourages the model to use tools in a way that yields a measurable reward signal.

The authors evaluate ParaVT with PARA-GRPO on six long-video understanding benchmarks and achieve an average improvement of 7.9% over the baseline Qwen3-VL model. Additionally, PARA-GRPO lifts training-time format compliance from 0.13 to 0.64, demonstrating the effectiveness of the proposed approach. The paper's contributions include a new framework for parallel video tool calling, a modified reinforcement learning algorithm, and a set of experimental results that demonstrate the benefits of the proposed approach. Overall, the paper provides a general recipe for agentic reinforcement learning that can be applied to a wide range of applications where tool capabilities are internalized in large multimodal models.

📅 Published on May 19

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/

🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

313 views21:50