AI & ML Papers

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

👍1

840 views15:53

This media is not supported in your browser

0:30

VIEW IN TELEGRAM

❤1

886 views15:53

741 views01:53

🔥 GenClaw: Code-Driven Agentic Image Generation

💡 The paper introduces GenClaw, a code-driven agentic image generation framework that enables precise visual construction through a staged process. The problem with existing image generation models is that they are black-box systems that rely on text-conditioned pixel synthesis, leaving them with no direct mechanism to manipulate the canvas. This leads to a repetitive cycle of prompt rewriting for generation refinement, limiting their potential for precise visual construction.

The GenClaw method addresses this issue by empowering the agent to create like a human artist, through three stages: conceptualization, sketching, and coloring. In the conceptualization stage, the agent constructs conceptual knowledge and context through search and reasoning. The agent then utilizes code, such as SVG or HTML, to render executable visual sketches in the sketching stage. Finally, it employs an image generation model to supplement textures, materials, and photorealism in the coloring stage.

By using code as a controllable intermediate canvas, GenClaw bridges linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. This approach transforms image generation from a black-box paradigm into a staged process, offering a step toward highly controllable and interpretable visual generation systems. The results of GenClaw demonstrate a more precise and interpretable image generation process, allowing for direct manipulation of the canvas and overcoming the limitations of existing black-box image models.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30248
• PDF: https://arxiv.org/pdf/2605.30248

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticImageGeneration #CodeDrivenArt #StagedImageConstruction #VisualConstructionTechniques #ImageGenerationFrameworks

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤5

914 views01:53

670 views21:53

🔥 VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild

💡 The paper introduces VibeSearchBench, a benchmark for evaluating long-horizon proactive search in real-world scenarios. The problem addressed is the poor performance of large language model-based agents in search tasks that involve multi-turn dialogue and collaborative refinement of user intent. Existing benchmarks rely on over-specified queries, single-turn interactions, and fixed-schema evaluation, which do not reflect real search behavior.

To address this issue, the authors propose VibeSearch, a paradigm that involves multi-turn dialogue and collaborative refinement of vague user intent. The VibeSearchBench benchmark consists of 200 manually curated bilingual tasks across 20 domains, split into professional and daily-life subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework.

The authors benchmark seven frontier models under two different frameworks and find that all models perform poorly, with the best F1 score being 30.30. This highlights the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction. The paper's contributions include the introduction of the VibeSearch paradigm, the creation of the VibeSearchBench benchmark, and the evaluation of state-of-the-art models in this new benchmark, which reveals the significant gap between current models and real-world search requirements.

📅 Published on May 27

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.27882
• PDF: https://arxiv.org/pdf/2605.27882
• Project Page: https://vibebench.github.io/VibeSearchBench.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ProactiveSearch #LongHorizonSearch #MultiTurnDialogue #CollaborativeSearch #NaturalLanguageSearch

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤1

762 views21:53

596 views07:53

🔥 SkillNet: Create, Evaluate, and Connect AI Skills

💡 The paper introduces SkillNet, an open infrastructure designed to systematically accumulate and transfer artificial intelligence skills across multiple domains. The problem addressed is that current AI agents lack a unified mechanism for skill consolidation, resulting in redundant efforts and limited long-term advancement. To overcome this limitation, SkillNet structures skills within a unified ontology that supports creating skills from diverse sources, establishing connections, and evaluating skills across multiple dimensions such as safety, completeness, and cost awareness.

The SkillNet infrastructure consists of a repository of over 200,000 skills, an interactive platform, and a Python toolkit. This infrastructure enables the creation, evaluation, and organization of AI skills at scale. By formalizing skills as evolving and composable assets, SkillNet provides a robust foundation for agents to move from transient experience to durable mastery.

The results of the paper demonstrate the effectiveness of SkillNet in enhancing agent performance. Experimental evaluations on various environments such as ALFWorld, WebShop, and ScienceWorld show that SkillNet significantly improves average rewards by 40 percent and reduces execution steps by 30 percent across multiple backbone models. Overall, the paper contributes to the development of a unified infrastructure for AI skill accumulation and transfer, which has the potential to accelerate the advancement of AI agents across multiple domains.

📅 Published on Feb 26

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.04448
• PDF: https://arxiv.org/pdf/2603.04448
• Project Page: http://skillnet.openkg.cn/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceSkills #AIInfrastructureDevelopment #SkillOntology #ArtificialGeneralIntelligence #TransferLearningMechanisms

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤2

744 views07:53

542 views03:50

🔥 OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

💡 The paper proposes a new method called OSCAR for ultra-low-bit key-value cache quantization, which is crucial for efficient deployment of large language models. The problem addressed is that existing quantization methods, such as simple rotations like Hadamard transforms, degrade in accuracy when applied to very low-bit representations, like 2-bit integers. This degradation occurs because these methods do not account for the attention-aware covariance structures that the model actually uses.

To solve this problem, OSCAR estimates the attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. This approach aligns the key-value cache quantization with the covariance structures that the model consumes, leading to higher accuracy and efficiency.

The authors provide theoretical justification for OSCAR and develop a fully deployable system that is compatible with modern large language model serving frameworks. They evaluate OSCAR on several reasoning models with long context lengths, up to 32,000 tokens, and achieve significant improvements in accuracy compared to naive rotation methods. Specifically, OSCAR reduces the accuracy gap to 3.78 and 1.42 points on two models, while naive rotation methods collapse to nearly zero.

The results also show that OSCAR scales well to larger models, remaining effectively on par with higher-precision representations. Additionally, OSCAR achieves significant system-wise improvements, including reducing key-value cache memory by approximately 8 times, improving throughput by up to 7 times, and accelerating batch-size-1 decoding by up to 3 times over higher-precision representations. Overall, the paper demonstrates that OSCAR is an effective and efficient method for ultra-low-bit key-value cache quantization, enabling the deployment of large language models with high accuracy and efficiency.

📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.17757
• PDF: https://arxiv.org/pdf/2605.17757
• Project Page: https://oscar-quantize.github.io/

🤖 Models citing this paper:
• https://huggingface.co/Zhongzhu/OSCAR-RotationZoo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#QuantizationMethods #LowBitRepresentations #KeyvalueCache #SpectralCovariance #EfficientDeployment

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

❤3

620 views03:50

437 views13:50

🔥 VLM3: Vision Language Models Are Native 3D Learners

💡 The paper VLM3 Vision Language Models Are Native 3D Learners presents a study that challenges the common approach to 3D understanding tasks in computer vision. Typically these tasks rely on specialized vision models with complex designs and extensive data augmentation. However the authors argue that vision language models can be adapted for 3D understanding tasks through simple architectural modifications and text-based training.

The problem addressed in this paper is that 3D understanding tasks such as depth estimation and object-level 3D understanding are currently dominated by expert vision models that have complex task-specific designs. The authors propose that vision language models can be native 3D learners and achieve comparable performance to these specialized models.

The method used in this study involves making three simple modifications to standard vision language models. These modifications include focal length unification, text-based pixel reference, and data mixture and scaling. The authors propose VLM3, a scalable method that enables standard vision language models to master diverse 3D tasks without requiring complex designs or extensive data augmentation.

The results of the study show that VLM3 advances the depth estimation accuracy of vision language models by a large margin, from 0.84 to 0.9. Additionally, VLM3 enables diverse 3D tasks such as pixel correspondence, camera pose estimation, and object-level 3D understanding, matching the accuracy of expert vision models while maintaining standard architectures and text-based training. Overall, the paper presents a new paradigm for simple and scalable 3D learning, demonstrating that vision language models can be effective native 3D learners.

📅 Published on May 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30561
• PDF: https://arxiv.org/pdf/2605.30561

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #3DUnderstanding #DepthEstimation #ObjectLevel3D #ComputerVisionModels

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

369 views13:50

🔥 GGT-100K: Generative Ground Truth for Generalizable Real-World Image Restoration

💡 The paper addresses the problem of real-world image restoration being limited by the lack of high-quality paired training data. Existing synthetic datasets often fail to model real-world degradations, while capturing real-world paired datasets is expensive and difficult. To overcome this, the authors propose using generative multimodal foundation models to produce high-quality targets from real-world low-quality images, referred to as Generative Ground Truth.

The authors systematically evaluate nine state-of-the-art models and find that one model, Nano-Banana-2 with adaptive prompting, is particularly effective at synthesizing realistic and content-faithful high-quality targets. They then use this model to build a dataset, GGT-100K, which consists of over 103,000 low-quality and high-quality paired images covering diverse scenes and real-world degradations.

The results show that using GGT-100K as a training dataset consistently improves the real-world generalization of a wide range of image restoration models, particularly when fine-tuning generative models. The authors conclude that their approach can serve as a practical tool for generating high-quality training data for image restoration tasks, and that GGT-100K is a useful resource for expanding the generalization capabilities of real-world image restoration models.

📅 Published on May 29

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.31039
• PDF: https://arxiv.org/pdf/2605.31039
• Project Page: https://polyu-vclab.github.io/GGT-100K/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/VCLab-PolyU/GGT-100K

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ImageRestoration #GenerativeGroundTruth #RealWorldDegradations #MultimodalFoundationModels #GenerativeMultimodalLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

370 views13:51

This media is not supported in your browser

0:06

340 views13:51

309 views13:51

🔥 Scaling Agents via Continual Pre-training

💡 The paper addresses the issue of large language models underperforming in agentic tasks despite being capable of autonomous tool use and multi-step reasoning. The root cause of this underperformance is identified as the lack of robust agentic foundation models, which forces models to learn diverse agentic behaviors and align them to expert demonstrations simultaneously during post-training, resulting in optimization tensions. To overcome this, the authors propose incorporating Agentic Continual Pre-training into the training pipeline to build powerful agentic foundational models. They develop a deep research agent model called AgentFounder based on this approach. The AgentFounder model is evaluated on 10 benchmarks and achieves state-of-the-art performance while retaining strong tool-use ability, with notable results including 39.9 percent on BrowseComp-en, 43.3 percent on BrowseComp-zh, and 31.5 percent Pass at 1 on HLE. The contributions of the paper include the introduction of Agentic Continual Pre-training and the development of the AgentFounder model, which demonstrates the effectiveness of this approach in building robust agentic foundation models.

📅 Published on Sep 16, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2509.13310
• PDF: https://arxiv.org/pdf/2509.13310
• Project Page: https://tongyi-agent.github.io/blog/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticFoundationModels #ContinualPretraining #AutonomousToolUse #MultiStepReasoning #AgenticBehaviorLearning

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

385 views13:51

305 views13:51

🔥 WebShaper: Agentically Data Synthesizing via Information-Seeking Formalization

💡 The paper introduces WebShaper, a framework that synthesizes information-seeking datasets to improve the performance of artificial intelligence agents. The problem addressed is the scarcity of high-quality training data for information-seeking tasks, which are complex and open-ended. Existing approaches typically collect web data and then generate questions, but this can lead to inconsistencies between the information structure and the reasoning structure of the questions and answers.

To solve this problem, WebShaper uses a formalization-driven approach based on set theory and Knowledge Projections. This approach enables precise control over the reasoning structure of the synthesized data. The framework starts by creating seed tasks and then expands them into more complex questions using a multi-step process. The expansion process involves an agentic Expander that uses retrieval and validation tools to ensure the quality of the synthesized data.

The key contribution of WebShaper is its ability to systematically formalize information-seeking tasks and synthesize high-quality datasets. The framework is evaluated on two open-sourced benchmarks, GAIA and WebWalkerQA, and achieves state-of-the-art performance. The results demonstrate that WebShaper is effective in synthesizing datasets that can train information-seeking agents to achieve top performance. Overall, WebShaper provides a novel solution to the problem of data scarcity in information-seeking tasks and has the potential to improve the performance of artificial intelligence agents in complex and open-ended tasks.

📅 Published on Jul 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• Project Page: https://huggingface.co/papers?q=Knowledge%20Projections%20(KP)
• arXiv: https://arxiv.org/abs/2507.15061
• PDF: https://arxiv.org/pdf/2507.15061

🤖 Models citing this paper:
• https://huggingface.co/Alibaba-NLP/WebShaper-32B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Alibaba-NLP/WebShaper
• https://huggingface.co/datasets/JingmingChen/PathRefiner

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #InformationSeekingTasks #DataSynthesisTechniques #KnowledgeProjections #FormalizationDrivenApproaches

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

371 views13:51

346 views13:51