AI & ML Papers
Photo
🔥 Cybersecurity AI: Humanoid Robots as Attack Vectors
📅 Published on Sep 17, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.14139
• PDF: https://arxiv.org/pdf/2509.14139
• Project Page: https://aliasrobotics.com
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#CybersecurityVulnerabilities #HumanoidRobotExploits #BLEProtocolVulnerabilities #RoboticsSecurityRisks #ArtificialIntelligenceThreats
💡 The paper presents a security assessment of the Unitree G1 humanoid robot, which is found to be vulnerable to exploits due to a critical command injection vulnerability in its BLE provisioning protocol. This vulnerability allows for root access via malformed Wi-Fi credentials, which can be exploited using hardcoded AES keys shared across all units. The researchers were able to partially reverse engineer the robot's proprietary FMX encryption, revealing a static Blowfish-ECB layer and a predictable LCG mask.
The study reveals two significant risks associated with the robot. Firstly, it can function as a trojan horse, continuously exfiltrating sensor and service-state telemetry to specific IP addresses without the operator's notice, violating GDPR regulations. Secondly, a resident Cybersecurity AI agent can pivot from reconnaissance to offensive preparation against any target, such as the manufacturer's cloud control plane, demonstrating the potential for escalation from passive monitoring to active counter-operations.
The researchers argue that these findings highlight the need for improved security standards in commercial robotics, particularly as humanoids move into critical infrastructure. The study contributes empirical evidence to shape future security standards for physical-cyber convergence systems, suggesting the need for adaptive Cybersecurity AI-powered defenses to mitigate these risks. The paper's contributions include the identification of critical vulnerabilities in the Unitree G1 humanoid robot, the demonstration of its potential as a covert surveillance node and active cyber operations platform, and the emphasis on the need for enhanced security measures to protect against such threats.
📅 Published on Sep 17, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.14139
• PDF: https://arxiv.org/pdf/2509.14139
• Project Page: https://aliasrobotics.com
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#CybersecurityVulnerabilities #HumanoidRobotExploits #BLEProtocolVulnerabilities #RoboticsSecurityRisks #ArtificialIntelligenceThreats
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform
📅 Published on Oct 22, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.19169
• PDF: https://arxiv.org/pdf/2510.19169
• Project Page: https://openguardrails.com
🤖 Models citing this paper:
• https://huggingface.co/openguardrails/OpenGuardrails-Text-2510
• https://huggingface.co/openguardrails/OpenGuardrails-Text-4B-0124
📊 Datasets citing this paper:
• https://huggingface.co/datasets/openguardrails/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/qtqtqtqt/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/ruishen123/OpenGuardrailsMixZh_97k
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#ContextAwareAI #LargeLanguageModels #ContentSafety #ModelManipulation #NamedEntityRecognition
💡 OpenGuardrails is an open source project that provides a unified model for detecting content safety and model manipulation risks in large language models. The project aims to address the critical issue of safeguarding large language models against unsafe, malicious, or privacy violating content. The OpenGuardrails platform offers a comprehensive solution that includes a context aware safety and manipulation detection model, as well as a separate named entity recognition pipeline for identifying and redacting sensitive data.
The platform protects against various types of risks, including content safety risks, model manipulation attacks such as prompt injection and jailbreaking, and data leakage. The content safety and model manipulation detection are implemented using a unified large model, while data leakage identification and redaction are performed using a separate lightweight named entity recognition pipeline.
The OpenGuardrails system can be deployed in various ways, including as a security gateway or an API based service, with enterprise grade deployment options that ensure fully private deployment. The project achieves state of the art performance on safety benchmarks, excelling in both prompt and response classification across multiple languages, including English, Chinese, and multilingual tasks.
The key contributions of the OpenGuardrails project include providing a unified model for content safety and model manipulation detection, offering a separate named entity recognition pipeline for data leakage identification and redaction, and achieving state of the art performance on safety benchmarks. The project also makes all models available under the Apache 2.0 license for public use, allowing for widespread adoption and further development of the technology. Overall, OpenGuardrails provides a comprehensive and effective solution for safeguarding large language models against various types of risks, and its open source nature makes it a valuable resource for the data science community.
📅 Published on Oct 22, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.19169
• PDF: https://arxiv.org/pdf/2510.19169
• Project Page: https://openguardrails.com
🤖 Models citing this paper:
• https://huggingface.co/openguardrails/OpenGuardrails-Text-2510
• https://huggingface.co/openguardrails/OpenGuardrails-Text-4B-0124
📊 Datasets citing this paper:
• https://huggingface.co/datasets/openguardrails/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/qtqtqtqt/OpenGuardrailsMixZh_97k
• https://huggingface.co/datasets/ruishen123/OpenGuardrailsMixZh_97k
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#ContextAwareAI #LargeLanguageModels #ContentSafety #ModelManipulation #NamedEntityRecognition
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning
📅 Published on Jul 25, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/
🤖 Models citing this paper:
• https://huggingface.co/pirola/local-ai-coding-stack-research
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning
💡 The paper introduces GEPA, a prompt optimizer that uses natural language reflection to learn high level rules from trial and error, outperforming reinforcement learning methods. The problem addressed is that current reinforcement learning methods, such as Group Relative Policy Optimization, require thousands of rollouts to learn new tasks, which can be time consuming and inefficient. The authors argue that the interpretable nature of language can provide a richer learning medium for large language models compared to policy gradients derived from sparse scalar rewards.
The method used is GEPA, a Genetic-Pareto prompt optimizer that incorporates natural language reflection to learn high level rules from trial and error. GEPA samples system level trajectories, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines complementary lessons from its own attempts. This approach allows GEPA to turn even a few rollouts into a large quality gain.
The results show that GEPA outperforms Group Relative Policy Optimization by 10 percent on average and by up to 20 percent, while using up to 35 times fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percent across two large language models. Additionally, GEPA demonstrates promising results as an inference time search strategy for code optimization. Overall, the paper contributes a new approach to prompt optimization that can efficiently learn high level rules from trial and error, outperforming current reinforcement learning methods.
📅 Published on Jul 25, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/
🤖 Models citing this paper:
• https://huggingface.co/pirola/local-ai-coding-stack-research
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤3
AI & ML Papers
Photo
🔥 SkillOpt: Executive Strategy for Self-Evolving Agent Skills
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization
💡 The paper introduces SkillOpt, a systematic approach to optimize agent skills through a text-space optimizer. Currently, agent skills are either hand-crafted, generated in one shot, or evolved through self-revision, which often results in unreliable improvements. SkillOpt addresses this issue by training skills as external state of a frozen agent, similar to how deep learning optimizers work.
The method involves a separate optimizer model that takes scored rollouts and applies bounded edits to a single skill document, accepting edits only when they improve a held-out validation score. To ensure stability, SkillOpt uses a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow updates, all of which add zero inference-time model calls at deployment.
The results show that SkillOpt outperforms existing methods across six benchmarks, seven target models, and three execution environments. It achieves the best or tied performance on all 52 evaluated cells and beats every competitor, including human, one-shot LLM, and other skill optimization methods. Notably, SkillOpt improves the average no-skill accuracy by 23.5 points on GPT-5.5 in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code.
Furthermore, transfer experiments demonstrate that optimized skill artifacts retain their value when moved across model scales, between different execution environments, and to nearby benchmarks without further optimization. Overall, SkillOpt provides a systematic and controllable approach to optimize agent skills, resulting in superior performance and reliable improvements.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models
📅 Published on May 20
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens
🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling
💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.
Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.
The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.
📅 Published on May 20
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens
🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 PhotoFlow: Agentic 3D Virtual Photography Missions
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems
💡 The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent that enables language-conditioned virtual photography in arbitrary 3D scenes. The problem addressed is to create an agent that can enter a 3D scene, infer a suitable shot based on scene information and language intent, and render a photograph without preselected camera pose or reference image. This task requires complex 3D spatial understanding and abstract aesthetic judgment, which are difficult to evaluate together.
The method proposed is a closed-loop camera search using the Director-Reviewer-Reflector agent. The Director builds a photographic blueprint and proposes candidate cameras, the Reviewer checks and critiques the proposals, and the Reflector converts failures into region memory and adjusts the search. The authors also introduce VPhotoBench, a benchmark of 47 open-license 3D scenes and 141 language-conditioned photography missions.
The results show that PhotoFlow achieves the strongest external quality-alignment composite and success rate among various methods, including one-shot prediction, single-chain reflection, anchor-bank selection, and random search, under a six-round rendering budget. The paper demonstrates that a language model-centered spatial agent can produce strong photographs in a setting that challenges both 3D reasoning and aesthetic choice, making language-conditioned virtual photography in arbitrary 3D scenes an executable agent task.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.
The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.
The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.
The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.
📅 Published on May 22
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/
🤖 Models citing this paper:
• https://huggingface.co/zizhaotong/SCOPE
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zizhaotong/CrossFPS-train
• https://huggingface.co/datasets/zizhaotong/CrossFPS-val
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 More Thought, Less Accuracy? On the Dual Nature of Reasoning in Vision-Language Models
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
💡 This paper explores the concept of reasoning in Vision Language Models and identifies a dual nature of multimodal reasoning. While reasoning enhances logical inference and improves performance on complex tasks, it can also impair perceptual grounding, leading to recognition failures on basic visual questions. The authors attribute this phenomenon to visual forgetting, where prolonged reasoning causes the model to disregard visual input. To address this issue, the authors propose Vision Anchored Policy Optimization, a method that steers the reasoning process toward visually grounded trajectories. The resulting model, VAPO Thinker 7B, significantly strengthens the model's reliance on visual information and achieves state of the art results on a range of benchmarks. The key contribution of this paper is the identification of the dual nature of multimodal reasoning and the development of a method to balance reasoning and visual grounding, leading to improved performance on visual tasks.
📅 Published on Sep 30, 2025
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.25848
• PDF: https://arxiv.org/pdf/2509.25848
• Project Page: https://xytian1008.github.io/VAPO/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #MultimodalReasoning #VisualForgetting #VisionAnchoredPolicyOptimization #PerceptualGrounding
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive
🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision
💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.
The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.
The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive
🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat
📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 WBench: A Comprehensive Multi-turn Benchmark for Interactive Video World Model Evaluation
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis
💡 The paper introduces WBench, a comprehensive benchmark for evaluating interactive world models. The problem addressed is that existing benchmarks for interactive world models are limited and do not provide a unified standard for evaluation. To fill this gap, the authors created WBench, which evaluates models across five dimensions: video quality, setting adherence, interaction adherence, consistency, and physics compliance.
The method used to create WBench involves 289 test cases and 1058 interaction turns, covering diverse scenarios and interaction types, including navigation, subject action, event editing, and perspective switching. The benchmark unifies different input interfaces, such as text, 6-DoF pose, and discrete-action control, allowing for the evaluation of models with different native input interfaces. The evaluation uses 22 automatic sub-metrics that combine specialist vision models with large multimodal models, and all metrics are validated against human judgments.
The results show that no single model performs strongly across all dimensions. The authors evaluated 20 state-of-the-art models using WBench and found that each model has characteristic strengths, weaknesses, and open challenges. The paper provides detailed diagnostic insights into the performance of each model, highlighting areas for improvement. The code and data for WBench are made available, allowing other researchers to use the benchmark to evaluate and improve their own interactive world models. Overall, the paper contributes to the development of interactive world models by providing a comprehensive and unified benchmark for evaluation.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.25874
• PDF: https://arxiv.org/pdf/2605.25874
• Project Page: https://meituan-longcat.github.io/WBench/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModelEvaluation #InteractiveVideoBenchmarking #MultiturnDialogueSystems #VideoQualityAssessment #ArtificialIntelligenceForVideoAnalysis
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 ParaVT: Taming the Tool Prior Paradox for Parallel Tool Use in Agentic Video Reinforcement Learning
📅 Published on May 19
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/
🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox
💡 The paper introduces ParaVT, a multi-agent reinforcement learning framework for parallel video tool calling, which enables the use of multiple video-processing tools simultaneously. This approach addresses the limitations of existing sequential methods, where a single incorrect tool call can propagate errors and corrupt context. The authors identify a key challenge in applying standard reinforcement learning to ParaVT, known as the Tool Prior Paradox, where pretrained tool priors enable tool exploration but also destabilize the model's structural format and create a shortcut for skipping tools.
To address this issue, the authors propose PARA-GRPO, a modified reinforcement learning algorithm that incorporates two complementary mechanisms: a targeted format reward and a per-prompt frame-budget randomization. The targeted format reward helps to stabilize the model's structural format, while the frame-budget randomization encourages the model to use tools in a way that yields a measurable reward signal.
The authors evaluate ParaVT with PARA-GRPO on six long-video understanding benchmarks and achieve an average improvement of 7.9% over the baseline Qwen3-VL model. Additionally, PARA-GRPO lifts training-time format compliance from 0.13 to 0.64, demonstrating the effectiveness of the proposed approach. The paper's contributions include a new framework for parallel video tool calling, a modified reinforcement learning algorithm, and a set of experimental results that demonstrate the benefits of the proposed approach. Overall, the paper provides a general recipe for agentic reinforcement learning that can be applied to a wide range of applications where tool capabilities are internalized in large multimodal models.
📅 Published on May 19
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.20342
• PDF: https://arxiv.org/pdf/2605.20342
• Project Page: https://evolvinglmms-lab.github.io/ParaVT/
🤖 Models citing this paper:
• https://huggingface.co/ParaVT/ParaVT-8B
📊 Datasets citing this paper:
• https://huggingface.co/datasets/ParaVT/ParaVT-Source
• https://huggingface.co/datasets/ParaVT/ParaVT-Parquet
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ParaVT/ParaVT
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticVideoReinforcementLearning #ParallelToolUse #MultiAgentReinforcementLearning #VideoToolCalling #ToolPriorParadox
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.