AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
🔥 MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research

💡 The paper introduces MobileGym, a browser-based mobile environment designed for mobile GUI agent research. The main problem addressed is the lack of a verifiable and highly parallel simulation platform for training and evaluating mobile GUI agents. Traditional methods are limited by their inability to provide deterministic outcome signals and scalable reinforcement learning.

The authors propose MobileGym as a solution, which enables deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution. The platform captures the full environment state as structured JSON, allowing for easy configuration, forking, and comparison of states. This approach enables a single server to host hundreds of parallel instances, with low memory requirements and fast startup times.

MobileGym features a layered state model and a declarative task-definition framework, making it practical to create and program tasks at scale. The platform also includes a single programmatic judging mechanism that delivers both deterministic evaluation verdicts and dense RL rewards. To facilitate research, the authors provide MobileGym-Bench, a collection of 416 parameterized task templates across 28 apps, including 256 test and 160 train templates.

The results demonstrate the effectiveness of MobileGym in a Sim-to-Real case study, where a model trained in the simulation environment achieves a 12.8 percentage point gain on a 256-task test set. When executed on real devices, the model retains 95.1% of the simulation-side training gain, indicating the potential of MobileGym for real-world applications. Overall, MobileGym provides a verifiable and highly parallel simulation platform for mobile GUI agent research, enabling scalable reinforcement learning and deterministic evaluation.


📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26114
• PDF: https://arxiv.org/pdf/2605.26114
• Project Page: https://mobilegym.github.io

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MobileGUIAgents #ParallelSimulation #ReinforcementLearning #MobileEnvironmentSimulation #GUIAgentResearch
👍1
AI & ML Papers
Photo
🔥 GenClaw: Code-Driven Agentic Image Generation

💡 The paper introduces GenClaw, a code-driven agentic image generation framework that enables precise visual construction through a staged process. The problem with existing image generation models is that they are black-box systems that rely on text-conditioned pixel synthesis, leaving them with no direct mechanism to manipulate the canvas. This leads to a repetitive cycle of prompt rewriting for generation refinement, limiting their potential for precise visual construction.

The GenClaw method addresses this issue by empowering the agent to create like a human artist, through three stages: conceptualization, sketching, and coloring. In the conceptualization stage, the agent constructs conceptual knowledge and context through search and reasoning. The agent then utilizes code, such as SVG or HTML, to render executable visual sketches in the sketching stage. Finally, it employs an image generation model to supplement textures, materials, and photorealism in the coloring stage.

By using code as a controllable intermediate canvas, GenClaw bridges linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. This approach transforms image generation from a black-box paradigm into a staged process, offering a step toward highly controllable and interpretable visual generation systems. The results of GenClaw demonstrate a more precise and interpretable image generation process, allowing for direct manipulation of the canvas and overcoming the limitations of existing black-box image models.


📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30248
• PDF: https://arxiv.org/pdf/2605.30248

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticImageGeneration #CodeDrivenArt #StagedImageConstruction #VisualConstructionTechniques #ImageGenerationFrameworks
5
AI & ML Papers
Photo
🔥 VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

💡 The paper introduces VibeSearchBench, a benchmark for evaluating long-horizon proactive search in real-world scenarios. The problem addressed is the poor performance of large language model-based agents in search tasks that involve multi-turn dialogue and collaborative refinement of user intent. Existing benchmarks rely on over-specified queries, single-turn interactions, and fixed-schema evaluation, which do not reflect real search behavior.

To address this issue, the authors propose VibeSearch, a paradigm that involves multi-turn dialogue and collaborative refinement of vague user intent. The VibeSearchBench benchmark consists of 200 manually curated bilingual tasks across 20 domains, split into professional and daily-life subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework.

The authors benchmark seven frontier models under two different frameworks and find that all models perform poorly, with the best F1 score being 30.30. This highlights the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction. The paper's contributions include the introduction of the VibeSearch paradigm, the creation of the VibeSearchBench benchmark, and the evaluation of state-of-the-art models in this new benchmark, which reveals the significant gap between current models and real-world search requirements.


📅 Published on May 27

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.27882
• PDF: https://arxiv.org/pdf/2605.27882
• Project Page: https://vibebench.github.io/VibeSearchBench.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ProactiveSearch #LongHorizonSearch #MultiTurnDialogue #CollaborativeSearch #NaturalLanguageSearch
1
AI & ML Papers
Photo
🔥 SkillNet: Create, Evaluate, and Connect AI Skills

💡 The paper introduces SkillNet, an open infrastructure designed to systematically accumulate and transfer artificial intelligence skills across multiple domains. The problem addressed is that current AI agents lack a unified mechanism for skill consolidation, resulting in redundant efforts and limited long-term advancement. To overcome this limitation, SkillNet structures skills within a unified ontology that supports creating skills from diverse sources, establishing connections, and evaluating skills across multiple dimensions such as safety, completeness, and cost awareness.

The SkillNet infrastructure consists of a repository of over 200,000 skills, an interactive platform, and a Python toolkit. This infrastructure enables the creation, evaluation, and organization of AI skills at scale. By formalizing skills as evolving and composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

The results of the paper demonstrate the effectiveness of SkillNet in enhancing agent performance. Experimental evaluations on various environments such as ALFWorld, WebShop, and ScienceWorld show that SkillNet significantly improves average rewards by 40 percent and reduces execution steps by 30 percent across multiple backbone models. Overall, the paper contributes to the development of a unified infrastructure for AI skill accumulation and transfer, which has the potential to accelerate the advancement of AI agents across multiple domains.


📅 Published on Feb 26

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.04448
• PDF: https://arxiv.org/pdf/2603.04448
• Project Page: http://skillnet.openkg.cn/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceSkills #AIInfrastructureDevelopment #SkillOntology #ArtificialGeneralIntelligence #TransferLearningMechanisms
2
AI & ML Papers
Photo
🔥 OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

💡 The paper proposes a new method called OSCAR for ultra-low-bit key-value cache quantization, which is crucial for efficient deployment of large language models. The problem addressed is that existing quantization methods, such as simple rotations like Hadamard transforms, degrade in accuracy when applied to very low-bit representations, like 2-bit integers. This degradation occurs because these methods do not account for the attention-aware covariance structures that the model actually uses.

To solve this problem, OSCAR estimates the attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. This approach aligns the key-value cache quantization with the covariance structures that the model consumes, leading to higher accuracy and efficiency.

The authors provide theoretical justification for OSCAR and develop a fully deployable system that is compatible with modern large language model serving frameworks. They evaluate OSCAR on several reasoning models with long context lengths, up to 32,000 tokens, and achieve significant improvements in accuracy compared to naive rotation methods. Specifically, OSCAR reduces the accuracy gap to 3.78 and 1.42 points on two models, while naive rotation methods collapse to nearly zero.

The results also show that OSCAR scales well to larger models, remaining effectively on par with higher-precision representations. Additionally, OSCAR achieves significant system-wise improvements, including reducing key-value cache memory by approximately 8 times, improving throughput by up to 7 times, and accelerating batch-size-1 decoding by up to 3 times over higher-precision representations. Overall, the paper demonstrates that OSCAR is an effective and efficient method for ultra-low-bit key-value cache quantization, enabling the deployment of large language models with high accuracy and efficiency.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.17757
• PDF: https://arxiv.org/pdf/2605.17757
• Project Page: https://oscar-quantize.github.io/

🤖 Models citing this paper:
https://huggingface.co/Zhongzhu/OSCAR-RotationZoo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#QuantizationMethods #LowBitRepresentations #KeyvalueCache #SpectralCovariance #EfficientDeployment
3
AI & ML Papers
Photo
🔥 VLM3: Vision Language Models Are Native 3D Learners

💡 The paper VLM3 Vision Language Models Are Native 3D Learners presents a study that challenges the common approach to 3D understanding tasks in computer vision. Typically these tasks rely on specialized vision models with complex designs and extensive data augmentation. However the authors argue that vision language models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training.

The problem addressed in this paper is that 3D understanding tasks such as depth estimation and object-level 3D understanding are currently dominated by expert vision models that have complex task-specific designs. The authors propose that vision language models can be native 3D learners and achieve comparable performance to these specialized models.

The method used in this study involves making three simple modifications to standard vision language models. These modifications include focal length unification, text-based pixel reference, and data mixture and scaling. The authors propose VLM3, a scalable method that enables standard vision language models to master diverse 3D tasks without requiring complex designs or extensive data augmentation.

The results of the study show that VLM3 advances the depth estimation accuracy of vision language models by a large margin, from 0.84 to 0.9. Additionally, VLM3 enables diverse 3D tasks such as pixel correspondence, camera pose estimation, and object-level 3D understanding, matching the accuracy of expert vision models while maintaining standard architectures and text-based training. Overall, the paper presents a new paradigm for simple and scalable 3D learning, demonstrating that vision language models can be effective native 3D learners.


📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels
🔥 GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

💡 The paper addresses the problem of real-world image restoration being limited by the lack of high-quality paired training data. Existing synthetic datasets often fail to model real-world degradations, while capturing real-world paired datasets is expensive and difficult. To overcome this, the authors propose using generative multimodal foundation models to produce high-quality targets from real-world low-quality images, referred to as Generative Ground Truth.

The authors systematically evaluate nine state-of-the-art models and find that one model, Nano-Banana-2 with adaptive prompting, is particularly effective at synthesizing realistic and content-faithful high-quality targets. They then use this model to build a dataset, GGT-100K, which consists of over 103,000 low-quality and high-quality paired images covering diverse scenes and real-world degradations.

The results show that using GGT-100K as a training dataset consistently improves the real-world generalization of a wide range of image restoration models, particularly when fine-tuning generative models. The authors conclude that their approach can serve as a practical tool for generating high-quality training data for image restoration tasks, and that GGT-100K is a useful resource for expanding the generalization capabilities of real-world image restoration models.


📅 Published on May 29

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.31039
• PDF: https://arxiv.org/pdf/2605.31039
• Project Page: https://polyu-vclab.github.io/GGT-100K/

📊 Datasets citing this paper:
https://huggingface.co/datasets/VCLab-PolyU/GGT-100K

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ImageRestoration #GenerativeGroundTruth #RealWorldDegradations #MultimodalFoundationModels #GenerativeMultimodalLearning
AI & ML Papers
Photo
🔥 Scaling Agents via Continual Pre-training

💡 The paper addresses the issue of large language models underperforming in agentic tasks despite being capable of autonomous tool use and multi-step reasoning. The root cause of this underperformance is identified as the lack of robust agentic foundation models, which forces models to learn diverse agentic behaviors and align them to expert demonstrations simultaneously during post-training, resulting in optimization tensions. To overcome this, the authors propose incorporating Agentic Continual Pre-training into the training pipeline to build powerful agentic foundational models. They develop a deep research agent model called AgentFounder based on this approach. The AgentFounder model is evaluated on 10 benchmarks and achieves state-of-the-art performance while retaining strong tool-use ability, with notable results including 39.9 percent on BrowseComp-en, 43.3 percent on BrowseComp-zh, and 31.5 percent Pass at 1 on HLE. The contributions of the paper include the introduction of Agentic Continual Pre-training and the development of the AgentFounder model, which demonstrates the effectiveness of this approach in building robust agentic foundation models.


📅 Published on Sep 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.13310
• PDF: https://arxiv.org/pdf/2509.13310
• Project Page: https://tongyi-agent.github.io/blog/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticFoundationModels #ContinualPretraining #AutonomousToolUse #MultiStepReasoning #AgenticBehaviorLearning
AI & ML Papers
Photo
🔥 WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

💡 The paper introduces WebShaper, a framework that synthesizes information-seeking datasets to improve the performance of artificial intelligence agents. The problem addressed is the scarcity of high-quality training data for information-seeking tasks, which are complex and open-ended. Existing approaches typically collect web data and then generate questions, but this can lead to inconsistencies between the information structure and the reasoning structure of the questions and answers.

To solve this problem, WebShaper uses a formalization-driven approach based on set theory and Knowledge Projections. This approach enables precise control over the reasoning structure of the synthesized data. The framework starts by creating seed tasks and then expands them into more complex questions using a multi-step process. The expansion process involves an agentic Expander that uses retrieval and validation tools to ensure the quality of the synthesized data.

The key contribution of WebShaper is its ability to systematically formalize information-seeking tasks and synthesize high-quality datasets. The framework is evaluated on two open-sourced benchmarks, GAIA and WebWalkerQA, and achieves state-of-the-art performance. The results demonstrate that WebShaper is effective in synthesizing datasets that can train information-seeking agents to achieve top performance. Overall, WebShaper provides a novel solution to the problem of data scarcity in information-seeking tasks and has the potential to improve the performance of artificial intelligence agents in complex and open-ended tasks.


📅 Published on Jul 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• Project Page: https://huggingface.co/papers?q=Knowledge%20Projections%20(KP)
• arXiv: https://arxiv.org/abs/2507.15061
• PDF: https://arxiv.org/pdf/2507.15061

🤖 Models citing this paper:
https://huggingface.co/Alibaba-NLP/WebShaper-32B

📊 Datasets citing this paper:
https://huggingface.co/datasets/Alibaba-NLP/WebShaper
https://huggingface.co/datasets/JingmingChen/PathRefiner

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #InformationSeekingTasks #DataSynthesisTechniques #KnowledgeProjections #FormalizationDrivenApproaches
AI & ML Papers
Photo
🔥 WebWatcher: Breaking New Frontier of Vision-Language Deep Research Agent

💡 The paper introduces WebWatcher, a multimodal agent designed to improve visual-language reasoning in deep research tasks. The problem addressed is that most existing research agents are text-centric and overlook visual information, making multimodal deep research challenging. To solve this, WebWatcher is equipped with enhanced visual-language reasoning capabilities, leveraging synthetic multimodal trajectories for efficient training, utilizing various tools for deep reasoning, and enhancing generalization through reinforcement learning.

The method involves using high-quality synthetic multimodal trajectories for cold start training, which allows the agent to learn from both visual and textual information. The agent is also designed to work with various tools to improve its reasoning abilities. Additionally, the paper proposes a new benchmark called BrowseComp-VL, which is used to evaluate the capabilities of multimodal agents in complex information retrieval tasks involving both visual and textual information.

The results show that WebWatcher significantly outperforms existing baseline agents, including proprietary and open-source agents, in four challenging visual question answering benchmarks. This demonstrates the effectiveness of WebWatcher in solving complex multimodal information-seeking tasks and paves the way for further research in this area. Overall, the paper contributes to the development of multimodal agents with stronger reasoning abilities, which can handle both visual and textual information, and provides a new benchmark for evaluating the performance of such agents.


📅 Published on Aug 7, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2508.05748
• PDF: https://arxiv.org/pdf/2508.05748
• Project Page: https://tongyi-agent.github.io/blog/introducing-tongyi-deep-research/

🤖 Models citing this paper:
https://huggingface.co/Alibaba-NLP/WebWatcher-32B
https://huggingface.co/Alibaba-NLP/WebWatcher-7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalLearning #VisionLanguageReasoning #DeepResearchAgents #SyntheticMultimodalTrajectories #ReinforcementLearningForVision
1