AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
🔥 PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

💡 The paper introduces PhysX-Omni, a unified framework for generating simulation-ready 3D assets with physical properties across multiple categories. The problem addressed is that existing 3D generation methods either neglect physical properties or are limited to a single asset category, such as rigid, deformable, or articulated objects. To address this, the authors develop a novel geometry representation tailored for vision-language models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance.

The PhysX-Omni framework generates simulation-ready physical 3D assets using this novel geometry representation. The authors also construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. To evaluate the framework, they propose PhysX-Bench, a benchmark that encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description.

The results show that PhysX-Omni performs strongly in both generation and understanding, outperforming conventional metrics and PhysX-Bench. Additional studies validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. The authors believe that PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

The key contributions of the paper are the development of a novel geometry representation, the construction of the PhysXVerse dataset, and the proposal of the PhysX-Bench benchmark. These contributions enable the generation of simulation-ready physical 3D assets across multiple categories, which can be used in various applications such as robotics, computer vision, and simulation. Overall, the paper presents a significant advancement in the field of 3D generation and simulation, with potential applications in a wide range of areas.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21572
• PDF: https://arxiv.org/pdf/2605.21572
• Project Page: https://physx-omni.github.io

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #3DModeling #PhysicsBasedSimulation #ArticulatedObjectSimulation #DeformableObjectModeling
AI & ML Papers
Photo
🔥 GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

💡 The paper proposes a self-evolving image generation framework called GenEvolve that improves generative capabilities through iterative learning and reference-based prompting. The problem addressed is that high-quality image generation often requires combining a model's internal generative ability with external resources, and existing methods have limitations in handling diverse and demanding requests.

The GenEvolve framework models each generation attempt as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing methods that rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience.

This visual experience is provided to a privileged teacher branch, which uses visual experience distillation to provide dense token-level supervision to a student branch. This helps the student internalize better search, knowledge activation, reference selection, and prompt construction. The authors also construct GenEvolve-Data and GenEvolve-Bench to evaluate the framework.

The results show that GenEvolve achieves substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. The experiments on public benchmarks and GenEvolve-Bench demonstrate the effectiveness of the proposed framework. Overall, the paper contributes a novel self-evolving image generation framework that can effectively handle diverse and demanding generation challenges.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21605
• PDF: https://arxiv.org/pdf/2605.21605
• Project Page: https://ephemeral182.github.io/GenEvolve/

🤖 Models citing this paper:
https://huggingface.co/MeiGen-AI/GenEvolve

📊 Datasets citing this paper:
https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ImageGeneration #GenerativeModels #SelfEvolvingSystems #DeepLearning
AI & ML Papers
Photo
🔥 LongCat-Video Technical Report

💡 The paper introduces LongCat-Video, a 13.6 billion parameter video generation model based on the Diffusion Transformer framework. The model is designed to generate high-quality long videos efficiently, which is a crucial step towards creating world models. LongCat-Video has a unified architecture that can perform multiple tasks, including text-to-video, image-to-video, and video continuation, using a single model.

The model achieves efficient long video generation through a coarse-to-fine generation strategy and block sparse attention, allowing it to generate 720p, 30fps videos within minutes. The coarse-to-fine generation strategy works by gradually increasing the resolution and detail of the video, both in terms of time and space. Block sparse attention is a technique that reduces the computational cost of the model by only attending to certain parts of the input data.

The model was trained using a multi-reward reinforcement learning from human feedback approach, which enables it to achieve performance comparable to state-of-the-art models. The use of multi-reward reinforcement learning from human feedback allows the model to learn from human evaluators and improve its performance over time.

The results show that LongCat-Video excels in generating high-quality long videos, maintaining temporal coherence and quality even in videos that are several minutes long. The model's efficiency and performance make it a significant contribution to the field of video generation, and the fact that the code and model weights are publicly available will accelerate progress in this area. Overall, LongCat-Video is a foundational model that takes an important step towards creating world models, which are complex models that can simulate and generate realistic videos and other types of data.


📅 Published on Oct 25, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.22200
• PDF: https://arxiv.org/pdf/2510.22200

🤖 Models citing this paper:
https://huggingface.co/meituan-longcat/LongCat-Video
https://huggingface.co/Nishant2414/LongCat-Video
https://huggingface.co/fjkane/LongCat-Video-bf16

🚀 Spaces citing this paper:
https://huggingface.co/spaces/cpuai/LongCat-Video-Avatar
https://huggingface.co/spaces/multimodalart/LongCat-Video
https://huggingface.co/spaces/armaishere/meituan-longcat-LongCat-Video

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoGenerationModels #DiffusionTransformer #LongVideoSynthesis #TextToVideoSynthesis #ImageToVideoGeneration
4
AI & ML Papers
Photo
🔥 SimpleMem: Efficient Lifelong Memory for LLM Agents

💡 The paper introduces SimpleMem, an efficient memory framework for lifelong learning in large language models. The problem addressed is the need for reliable long-term interaction in complex environments, which requires memory systems that efficiently manage historical experiences. Existing approaches either retain full interaction histories, leading to substantial redundancy, or rely on iterative reasoning to filter noise, incurring high token costs.

The proposed method, SimpleMem, is based on semantic lossless compression and consists of a three-stage pipeline designed to maximize information density and token utilization. The first stage, Semantic Structured Compression, applies entropy-aware filtering to distill unstructured interactions into compact, multi-view indexed memory units. The second stage, Recursive Memory Consolidation, is an asynchronous process that integrates related units into higher-level abstract representations to reduce redundancy. The third stage, Adaptive Query-Aware Retrieval, dynamically adjusts retrieval scope based on query complexity to construct precise context efficiently.

The experiments on benchmark datasets show that SimpleMem consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost. The method achieves an average F1 improvement of 26.4% while reducing inference-time token consumption by up to 30-fold, demonstrating a superior balance between performance and efficiency. The code is available for further research and development. Overall, SimpleMem provides an efficient and effective solution for lifelong learning in large language models, enabling reliable long-term interaction in complex environments.


📅 Published on Jan 5

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2601.02553
• PDF: https://arxiv.org/pdf/2601.02553
• Project Page: https://aiming-lab.github.io/SimpleMem-Page/

📊 Datasets citing this paper:
https://huggingface.co/datasets/molmohsen/awesome-ai-agent-papers
https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LifelongLearningAlgorithms #EfficientMemoryFrameworks #LargeLanguageModelOptimization #SemanticCompressionTechniques #LifelongMemoryManagement
3
AI & ML Papers
Photo
🔥 Cybersecurity AI: Humanoid Robots as Attack Vectors

💡 The paper presents a security assessment of the Unitree G1 humanoid robot, which is found to be vulnerable to exploits due to a critical command injection vulnerability in its BLE provisioning protocol. This vulnerability allows for root access via malformed Wi-Fi credentials, which can be exploited using hardcoded AES keys shared across all units. The researchers were able to partially reverse engineer the robot's proprietary FMX encryption, revealing a static Blowfish-ECB layer and a predictable LCG mask.

The study reveals two significant risks associated with the robot. Firstly, it can function as a trojan horse, continuously exfiltrating sensor and service-state telemetry to specific IP addresses without the operator's notice, violating GDPR regulations. Secondly, a resident Cybersecurity AI agent can pivot from reconnaissance to offensive preparation against any target, such as the manufacturer's cloud control plane, demonstrating the potential for escalation from passive monitoring to active counter-operations.

The researchers argue that these findings highlight the need for improved security standards in commercial robotics, particularly as humanoids move into critical infrastructure. The study contributes empirical evidence to shape future security standards for physical-cyber convergence systems, suggesting the need for adaptive Cybersecurity AI-powered defenses to mitigate these risks. The paper's contributions include the identification of critical vulnerabilities in the Unitree G1 humanoid robot, the demonstration of its potential as a covert surveillance node and active cyber operations platform, and the emphasis on the need for enhanced security measures to protect against such threats.


📅 Published on Sep 17, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.14139
• PDF: https://arxiv.org/pdf/2509.14139
• Project Page: https://aliasrobotics.com

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#CybersecurityVulnerabilities #HumanoidRobotExploits #BLEProtocolVulnerabilities #RoboticsSecurityRisks #ArtificialIntelligenceThreats
AI & ML Papers
Photo
🔥 OpenGuardrails: An Open-Source Context-Aware AI Guardrails Platform

💡 OpenGuardrails is an open source project that provides a unified model for detecting content safety and model manipulation risks in large language models. The project aims to address the critical issue of safeguarding large language models against unsafe, malicious, or privacy violating content. The OpenGuardrails platform offers a comprehensive solution that includes a context aware safety and manipulation detection model, as well as a separate named entity recognition pipeline for identifying and redacting sensitive data.

The platform protects against various types of risks, including content safety risks, model manipulation attacks such as prompt injection and jailbreaking, and data leakage. The content safety and model manipulation detection are implemented using a unified large model, while data leakage identification and redaction are performed using a separate lightweight named entity recognition pipeline.

The OpenGuardrails system can be deployed in various ways, including as a security gateway or an API based service, with enterprise grade deployment options that ensure fully private deployment. The project achieves state of the art performance on safety benchmarks, excelling in both prompt and response classification across multiple languages, including English, Chinese, and multilingual tasks.

The key contributions of the OpenGuardrails project include providing a unified model for content safety and model manipulation detection, offering a separate named entity recognition pipeline for data leakage identification and redaction, and achieving state of the art performance on safety benchmarks. The project also makes all models available under the Apache 2.0 license for public use, allowing for widespread adoption and further development of the technology. Overall, OpenGuardrails provides a comprehensive and effective solution for safeguarding large language models against various types of risks, and its open source nature makes it a valuable resource for the data science community.


📅 Published on Oct 22, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2510.19169
• PDF: https://arxiv.org/pdf/2510.19169
• Project Page: https://openguardrails.com

🤖 Models citing this paper:
https://huggingface.co/openguardrails/OpenGuardrails-Text-2510
https://huggingface.co/openguardrails/OpenGuardrails-Text-4B-0124

📊 Datasets citing this paper:
https://huggingface.co/datasets/openguardrails/OpenGuardrailsMixZh_97k
https://huggingface.co/datasets/qtqtqtqt/OpenGuardrailsMixZh_97k
https://huggingface.co/datasets/ruishen123/OpenGuardrailsMixZh_97k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ContextAwareAI #LargeLanguageModels #ContentSafety #ModelManipulation #NamedEntityRecognition
AI & ML Papers
Photo
🔥 GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

💡 The paper introduces GEPA, a prompt optimizer that uses natural language reflection to learn high level rules from trial and error, outperforming reinforcement learning methods. The problem addressed is that current reinforcement learning methods, such as Group Relative Policy Optimization, require thousands of rollouts to learn new tasks, which can be time consuming and inefficient. The authors argue that the interpretable nature of language can provide a richer learning medium for large language models compared to policy gradients derived from sparse scalar rewards.

The method used is GEPA, a Genetic-Pareto prompt optimizer that incorporates natural language reflection to learn high level rules from trial and error. GEPA samples system level trajectories, reflects on them in natural language to diagnose problems, proposes and tests prompt updates, and combines complementary lessons from its own attempts. This approach allows GEPA to turn even a few rollouts into a large quality gain.

The results show that GEPA outperforms Group Relative Policy Optimization by 10 percent on average and by up to 20 percent, while using up to 35 times fewer rollouts. GEPA also outperforms the leading prompt optimizer, MIPROv2, by over 10 percent across two large language models. Additionally, GEPA demonstrates promising results as an inference time search strategy for code optimization. Overall, the paper contributes a new approach to prompt optimization that can efficiently learn high level rules from trial and error, outperforming current reinforcement learning methods.


📅 Published on Jul 25, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2507.19457
• PDF: https://arxiv.org/pdf/2507.19457
• Project Page: https://gepa-ai.github.io/gepa/

🤖 Models citing this paper:
https://huggingface.co/pirola/local-ai-coding-stack-research

📊 Datasets citing this paper:
https://huggingface.co/datasets/zhongweixie/A-Survey-on-AI-Agent-Harness

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#NaturalLanguageReflection #PromptOptimization #ReinforcementLearningAlternatives #GeneticParetoOptimization #LanguageModelLearning
3
AI & ML Papers
Photo
🔥 SkillOpt: Executive Strategy for Self-Evolving Agent Skills

💡 The paper introduces SkillOpt, a systematic approach to optimize agent skills through a text-space optimizer. Currently, agent skills are either hand-crafted, generated in one shot, or evolved through self-revision, which often results in unreliable improvements. SkillOpt addresses this issue by training skills as external state of a frozen agent, similar to how deep learning optimizers work.

The method involves a separate optimizer model that takes scored rollouts and applies bounded edits to a single skill document, accepting edits only when they improve a held-out validation score. To ensure stability, SkillOpt uses a textual learning-rate budget, rejected-edit buffer, and epoch-wise slow updates, all of which add zero inference-time model calls at deployment.

The results show that SkillOpt outperforms existing methods across six benchmarks, seven target models, and three execution environments. It achieves the best or tied performance on all 52 evaluated cells and beats every competitor, including human, one-shot LLM, and other skill optimization methods. Notably, SkillOpt improves the average no-skill accuracy by 23.5 points on GPT-5.5 in direct chat, 24.8 points inside the Codex agentic loop, and 19.1 points inside Claude Code.

Furthermore, transfer experiments demonstrate that optimized skill artifacts retain their value when moved across model scales, between different execution environments, and to nearby benchmarks without further optimization. Overall, SkillOpt provides a systematic and controllable approach to optimize agent skills, resulting in superior performance and reliable improvements.


📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23904
• PDF: https://arxiv.org/pdf/2605.23904
• Project Page: https://microsoft.github.io/SkillOpt/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SelfEvolvingAgents #AgentSkillOptimization #TextSpaceOptimization #DeepLearningForAgents #ArtificialIntelligenceOptimization
AI & ML Papers
Photo
🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.

Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.

The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.


📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens

🤖 Models citing this paper:
https://huggingface.co/microsoft/Lens-Turbo
https://huggingface.co/microsoft/Lens
https://huggingface.co/microsoft/Lens-Base

🚀 Spaces citing this paper:
https://huggingface.co/spaces/multimodalart/lens

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling
AI & ML Papers
Photo
🔥 PhotoFlow: Agentic 3D Virtual Photography Missions

💡 The paper introduces PhotoFlow, a Director-Reviewer-Reflector agent that enables language-conditioned virtual photography in arbitrary 3D scenes. The problem addressed is to create an agent that can enter a 3D scene, infer a suitable shot based on scene information and language intent, and render a photograph without preselected camera pose or reference image. This task requires complex 3D spatial understanding and abstract aesthetic judgment, which are difficult to evaluate together.

The method proposed is a closed-loop camera search using the Director-Reviewer-Reflector agent. The Director builds a photographic blueprint and proposes candidate cameras, the Reviewer checks and critiques the proposals, and the Reflector converts failures into region memory and adjusts the search. The authors also introduce VPhotoBench, a benchmark of 47 open-license 3D scenes and 141 language-conditioned photography missions.

The results show that PhotoFlow achieves the strongest external quality-alignment composite and success rate among various methods, including one-shot prediction, single-chain reflection, anchor-bank selection, and random search, under a six-round rendering budget. The paper demonstrates that a language model-centered spatial agent can produce strong photographs in a setting that challenges both 3D reasoning and aesthetic choice, making language-conditioned virtual photography in arbitrary 3D scenes an executable agent task.


📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23771
• PDF: https://arxiv.org/pdf/2605.23771
• Project Page: https://visionary-laboratory.github.io/PhotoFlow/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VirtualPhotography #3DSceneUnderstanding #AgenticSystems #LanguageConditionedRendering #IntelligentCameraSystems
🔥 SCOPE: Simulating Cross-game Operations in Playable Environments for FPS World Models

💡 The paper introduces SCOPE, a method for simulating cross game operations in playable environments for first person shooter games. The problem addressed is that existing methods for interactive world models in FPS games struggle to handle high frequency overlapping control signals without disrupting unaffected regions. This is because they inject actions globally and are trained on single game titles, which fails under dense FPS inputs.

The proposed method conditions transformer blocks in video diffusion models to separate in scope from out of scope visual effects without requiring segmentation labels. This is achieved by inserting a conditioning module into each transformer block of a pre trained video diffusion model, which reshapes features into per pixel temporal sequences. This allows each position to compute its action response from local visual content, effectively separating in scope effects from out of scope generation.

The authors also introduce CrossFPS, a multi game FPS dataset with frame aligned action telemetry, comprising 69K clips from 7 titles with 10 degree of freedom controller signals. This dataset is curated to remove gameplay bias, allowing the model to learn general visual to action mappings rather than game specific patterns.

The results show that the SCOPE method enables strong action responsiveness, precise scope separation, and effective cross game generalization. The model is able to learn general visual to action mappings, which enables zero shot transfer to unseen scenes. This means that the model can be applied to new games without requiring additional training data, making it a significant contribution to the field of interactive world models for FPS games.


📅 Published on May 22

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.23345
• PDF: https://arxiv.org/pdf/2605.23345
• Project Page: https://z2tong.github.io/SCOPE/

🤖 Models citing this paper:
https://huggingface.co/zizhaotong/SCOPE

📊 Datasets citing this paper:
https://huggingface.co/datasets/zizhaotong/CrossFPS-train
https://huggingface.co/datasets/zizhaotong/CrossFPS-val

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FirstPersonShooterGames #CrossGameOperations #PlayableEnvironments #VideoDiffusionModels #TransformerBlocks