AI & ML Papers
Photo
🔥 MOSS-TTS Technical Report
📅 Published on Mar 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.18090
• PDF: https://arxiv.org/pdf/2603.18090
• Project Page: https://mosi.cn/models/moss-tts
🤖 Models citing this paper:
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime
📊 Datasets citing this paper:
• https://huggingface.co/datasets/somu9/mls_eng_tokens
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpeechGeneration #VoiceCloning #AutoregressiveModeling #DiscreteAudioTokens #TransformerTokenizer
💡 The MOSS-TTS technical report presents a speech generation model that utilizes discrete audio tokens and autoregressive modeling to achieve voice cloning, pronunciation control, and long-form generation across multiple languages. The model is built on a scalable recipe that includes a causal Transformer tokenizer, which compresses 24 kHz audio to 12.5 fps with variable-bitrate RVQ and unified semantic-acoustic representations. The report releases two complementary generators: MOSS-TTS, which emphasizes structural simplicity, scalability, and long-context/control-oriented deployment, and MOSS-TTS-Local-Transformer, which introduces a frame-local autoregressive module for higher modeling efficiency, stronger speaker preservation, and a shorter time to first audio.
The problem addressed by the report is the need for a speech generation model that can handle multilingual and open-domain settings, and support various features such as voice cloning, pronunciation control, and long-form generation. The method used to address this problem is the development of the MOSS-TTS model, which is built on a combination of discrete audio tokens, autoregressive modeling, and large-scale pretraining.
The results of the report show that the MOSS-TTS model supports zero-shot voice cloning, token-level duration control, phoneme-/pinyin-level pronunciation control, smooth code-switching, and stable long-form generation across multilingual and open-domain settings. The report also summarizes the design, training recipe, and empirical characteristics of the released models, providing a comprehensive overview of the MOSS-TTS model and its capabilities. Overall, the MOSS-TTS model presents a significant contribution to the field of speech generation, offering a scalable and efficient solution for a wide range of applications.
📅 Published on Mar 18
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.18090
• PDF: https://arxiv.org/pdf/2603.18090
• Project Page: https://mosi.cn/models/moss-tts
🤖 Models citing this paper:
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Nano-100M
• https://huggingface.co/OpenMOSS-Team/MOSS-TTS-Realtime
📊 Datasets citing this paper:
• https://huggingface.co/datasets/somu9/mls_eng_tokens
🚀 Spaces citing this paper:
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-v1.5
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS-Nano
• https://huggingface.co/spaces/OpenMOSS-Team/MOSS-TTS
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#SpeechGeneration #VoiceCloning #AutoregressiveModeling #DiscreteAudioTokens #TransformerTokenizer
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤3
Forwarded from Machine Learning
🔖 A huge open-source course on AI Engineering from scratch
In the repository, we've collected:
— 435 lessons;
— 320+ hours of content;
— Python, TypeScript, and Rust;
— AI agents, MCP servers, prompts, and AI skills.
Moreover, almost every lesson includes practical tasks, so this isn't just theory, but a full-fledged roadmap for AI Engineering. 🚀
⛓️ Link to the repository
https://github.com/rohitg00/ai-engineering-from-scratch
#AI #MachineLearning #Python #Rust #OpenSource #Tech
✨ Join Best TG Channels https://xn--r1a.website/addlist/0f6vfFbEMdAwODBk
⭐️ Join Our WhatsApp Channel https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
In the repository, we've collected:
— 435 lessons;
— 320+ hours of content;
— Python, TypeScript, and Rust;
— AI agents, MCP servers, prompts, and AI skills.
Moreover, almost every lesson includes practical tasks, so this isn't just theory, but a full-fledged roadmap for AI Engineering. 🚀
⛓️ Link to the repository
https://github.com/rohitg00/ai-engineering-from-scratch
#AI #MachineLearning #Python #Rust #OpenSource #Tech
✨ Join Best TG Channels https://xn--r1a.website/addlist/0f6vfFbEMdAwODBk
⭐️ Join Our WhatsApp Channel https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
❤3
AI & ML Papers
Photo
🔥 OSP-Next: Efficient High-Quality Video Generation with Sparse Sequence Parallelism, HiF8 Quantization, and Reinforcement Learning
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.28691
• PDF: https://arxiv.org/pdf/2605.28691
🤖 Models citing this paper:
• https://huggingface.co/yunyangge/OSP-Next
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoGenerationModels #SparseSequenceParallelism #HiF8Quantization #ReinforcementLearningForVideo #TextToVideoSynthesis
💡 The paper introduces OSP-Next, an efficient text-to-video generation model that addresses the high computational costs of existing models. The problem with current models, such as Diffusion Transformers, is that they achieve strong video generation quality but have quadratic costs due to full attention. To solve this, OSP-Next combines sparse attention, parallelism, quantization, and reinforcement learning.
The method used in OSP-Next is a hybrid full-sparse attention architecture, where the sparse component is implemented with Skiparse-2D Attention. This mechanism applies token-wise and group-wise sparse attention along spatial dimensions, leveraging locality while maintaining compatibility with FlashAttention kernels. The authors also propose Sparse Sequence Parallelism, which partitions subsequences across ranks and switches sparse patterns through a single All-to-All communication. This approach reduces communication volume by 75% compared to Ulysses Sequence Parallelism.
Additionally, OSP-Next incorporates HiF8 quantization to enable stable joint training with 8-bit quantization and sparse fine-tuning. The model also applies Mix-GRPO post-training to improve the performance of the sparse model. The authors evaluate OSP-Next on various settings, including 5-second 720P and 5-second 768P, and achieve significant speedups on NVIDIA H200 GPUs and Ascend 950PR hardware.
The results show that OSP-Next achieves a VBench total score of 83.73%, surpassing the Wan2.1 baseline. The model achieves up to 1.64 times single-GPU speedup and over 1.52 times eight-GPU speedup on NVIDIA H200 GPUs. Furthermore, with only a 0.4% drop in VBench total score, OSP-Next-HiF8 achieves 1.69 times and 2.27 times speedups under the two settings on a single Ascend 950PR, demonstrating the efficiency and performance of OSP-Next across hardware platforms. Overall, the paper contributes to the development of efficient text-to-video generation models with high-quality video synthesis and reduced computational costs.
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.28691
• PDF: https://arxiv.org/pdf/2605.28691
🤖 Models citing this paper:
• https://huggingface.co/yunyangge/OSP-Next
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoGenerationModels #SparseSequenceParallelism #HiF8Quantization #ReinforcementLearningForVideo #TextToVideoSynthesis
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1
AI & ML Papers
Photo
🔥 DenoiseRL: Bootstrapping Reasoning Models to Recover from Noisy Prefixes
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.28421
• PDF: https://arxiv.org/pdf/2605.28421
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DenoiseRL #ReinforcementLearningForNLP #NoisyPrefixRecovery #ReasoningModelOptimization #LargeLanguageModelImprovement
💡 The paper introduces DenoiseRL, a reinforcement learning framework that aims to improve reasoning in large language models by learning from incorrect reasoning traces. The problem with existing methods is that they rely heavily on stronger teacher models or carefully curated datasets, which limits their scalability and capability to improve. DenoiseRL addresses this issue by substituting external supervision with recovery-oriented optimization over failures from weak models. This approach allows the model to learn directly from incorrect reasoning traces, converting them into opportunities for improvement and making training more scalable and less dependent on external resources.
The method used in DenoiseRL involves failure-oriented optimization, where the model learns from its own mistakes and recovers from noisy prefixes. This approach yields a richer and more diverse learning signal, improving exploration efficiency from imperfect model behavior. As a result, DenoiseRL improves reasoning performance and overall training efficiency while reducing the need for expensive data curation or stronger teacher models.
The results of the paper show that DenoiseRL consistently outperforms strong on-policy RL baselines across competitive mathematical and general reasoning benchmarks. The framework also promotes stronger self-corrective behavior as training difficulty increases, highlighting an effective and scalable alternative pathway for improving reasoning in large language models. Overall, the paper contributes to the development of more efficient and scalable methods for improving reasoning in large language models, and demonstrates the potential of DenoiseRL as a framework for advancing reasoning capabilities in AI systems.
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.28421
• PDF: https://arxiv.org/pdf/2605.28421
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#DenoiseRL #ReinforcementLearningForNLP #NoisyPrefixRecovery #ReasoningModelOptimization #LargeLanguageModelImprovement
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1
AI & ML Papers
Photo
🔥 stable-worldmodel-v1: Reproducible World Modeling Research and Evaluation
📅 Published on Feb 9
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2602.08968
• PDF: https://arxiv.org/pdf/2602.08968
• Project Page: https://galilai-group.github.io/stable-worldmodel/
🤖 Models citing this paper:
• https://huggingface.co/zzsi/swm-dmc-cheetah
• https://huggingface.co/zzsi/swm-dmc-expert-policies
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zzsi/swm-dmc-expert
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-small
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-large
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModeling #ReinforcementLearning #ArtificialIntelligence #RoboticsResearch #EnvironmentModeling
💡 The paper introduces stable-worldmodel, a modular and standardized research framework for developing and evaluating world models. World models are a powerful tool for learning compact representations of environment dynamics, enabling agents to reason and generalize beyond direct experience. However, current implementations are often publication-specific, which limits their reusability, increases the risk of bugs, and reduces evaluation standardization.
To address this issue, the authors developed stable-worldmodel, a tested and documented research ecosystem that provides efficient data collection tools, standardized environments, planning algorithms, and baseline implementations. The framework allows for controllable environmental factors, including visual and physical properties, to support robustness and continual learning research.
The authors demonstrate the utility of stable-worldmodel by using it to study zero-shot robustness in DINO-WM. The framework provides a standardized way to evaluate world models, which can help to advance research in this area. The main contributions of the paper are the introduction of a modular and standardized research framework for world models, the provision of efficient data collection tools and standardized environments, and the demonstration of the framework's utility in studying zero-shot robustness.
Overall, the paper aims to provide a reliable and reproducible research framework for world modeling, which can help to accelerate progress in this field. The authors' goal is to enable researchers to focus on developing new world models and evaluating their performance, rather than spending time on implementing and debugging existing models. By providing a standardized framework, the authors hope to facilitate the development of more robust and generalizable world models that can be used in a variety of applications.
📅 Published on Feb 9
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2602.08968
• PDF: https://arxiv.org/pdf/2602.08968
• Project Page: https://galilai-group.github.io/stable-worldmodel/
🤖 Models citing this paper:
• https://huggingface.co/zzsi/swm-dmc-cheetah
• https://huggingface.co/zzsi/swm-dmc-expert-policies
📊 Datasets citing this paper:
• https://huggingface.co/datasets/zzsi/swm-dmc-expert
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-small
• https://huggingface.co/datasets/zzsi/swm-dmc-mixed-large
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#WorldModeling #ReinforcementLearning #ArtificialIntelligence #RoboticsResearch #EnvironmentModeling
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1👍1
Forwarded from Data Analytics
📰 Anthropic is rolling out Claude Opus 4.8 🚀
The model has become significantly more honest in evaluating its own work and notices problems in its own code four times more often. 🔍✨
Plus, dynamic workflows have appeared — hundreds of AI subagents can work on large projects and migrations in parallel. 🤖⚡
⛓️ More details here
https://www.anthropic.com/news/claude-opus-4-8
#Anthropic #ClaudeOpus48 #AI #ArtificialIntelligence #TechNews #Innovation
✨ Join Best TG Channels https://xn--r1a.website/addlist/0f6vfFbEMdAwODBk
⭐️ Join Our WhatsApp Channel https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
The model has become significantly more honest in evaluating its own work and notices problems in its own code four times more often. 🔍✨
Plus, dynamic workflows have appeared — hundreds of AI subagents can work on large projects and migrations in parallel. 🤖⚡
⛓️ More details here
https://www.anthropic.com/news/claude-opus-4-8
#Anthropic #ClaudeOpus48 #AI #ArtificialIntelligence #TechNews #Innovation
✨ Join Best TG Channels https://xn--r1a.website/addlist/0f6vfFbEMdAwODBk
⭐️ Join Our WhatsApp Channel https://whatsapp.com/channel/0029VaC7Weq29753hpcggW2A
❤2👍1
🚀 HelloEncyclo Presale is LIVE!
Master the skills that matter — Gen-AI, Data Science, Machine Learning and more — all in one place.
🎁 First 250 members get a flat 40% OFF
Use code: PRESALE-BOOK-WAVE-2GFG
✅ 13 full courses live right now
✅ 40+ more dropping in the next 2–3 weeks
✅ Complete library within 2 months — built and refined by industry experts
✅ 15-day money-back guarantee — don't love it? Get a full refund.
⚠️ Coupon works only after you log in with Gmail, and it's valid once per member.
👉 Log in now and start learning:
https://helloencyclo.com
Don't wait — the 40% deal disappears after the first 250 seats. 🔥
Master the skills that matter — Gen-AI, Data Science, Machine Learning and more — all in one place.
🎁 First 250 members get a flat 40% OFF
Use code: PRESALE-BOOK-WAVE-2GFG
✅ 13 full courses live right now
✅ 40+ more dropping in the next 2–3 weeks
✅ Complete library within 2 months — built and refined by industry experts
✅ 15-day money-back guarantee — don't love it? Get a full refund.
⚠️ Coupon works only after you log in with Gmail, and it's valid once per member.
👉 Log in now and start learning:
https://helloencyclo.com
Don't wait — the 40% deal disappears after the first 250 seats. 🔥
🔥 minWM: A Full-Stack Open-Source Framework for Real-Time Interactive Video World Models
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks
💡 The paper presents a comprehensive framework called minWM for converting bidirectional video diffusion models into real-time interactive video world models. The problem addressed is that recent video diffusion foundation models have achieved high-quality video generation but turning them into real-time interactive world models remains challenging due to the need for controllable, causal, and low-latency capabilities.
The method used in minWM is a full-stack open-source framework that provides an end-to-end pipeline to convert existing bidirectional video foundation models into camera-controllable few-step autoregressive world models. This is achieved through fine-tuning and distillation techniques, including causal forcing, causal consistency distillation, and asymmetric DMD. The framework is modular and architecture-extensible, allowing it to be instantiated on different open backbones and adapted to new data distributions, training recipes, and latency targets.
The results of minWM are a real-time interactive video world model that can be controlled by a camera, with low-latency rollout and high-quality video generation. The framework is released with runnable scripts, checkpoints, documentation, and inference code, along with practical ablations on camera trajectory quality, controllability training steps, and minimal batch-size requirements. Overall, minWM provides a reproducible and extensible recipe for building and adapting real-time interactive video world models, making it a valuable contribution to the field of video generation and interactive world modeling.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30263
• PDF: https://arxiv.org/pdf/2605.30263
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VideoDiffusionModels #RealTimeInteractiveSystems #VideoWorldModels #BidirectionalVideoGeneration #InteractiveVideoFrameworks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
AI & ML Papers
Photo
🔥 LoMo: Local Modality Substitution for Deeper Vision-Language Fusion
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
💡 The paper addresses the issue of modality sensitivity in vision-language models, which occurs when a model's performance degrades significantly when the modality of the input is changed, such as replacing a textual question with its rendered-image counterpart. This problem arises due to the inherent bias in current training corpora, where text and images are typically organized into distinct and asymmetric roles. To address this issue, the authors propose Local Modality Substitution, a data curation approach that provides supervision for cross-modal representational invariance between semantically equivalent text and image carriers. This method reformulates single-modality prompts into seamlessly interleaved multimodal sequences by dynamically selecting target text spans and recasting them as rendered images, thereby preserving the same semantics across different carriers. The authors evaluate their approach on 13 diverse multimodal benchmarks and demonstrate that it significantly improves overall multimodal reasoning and yields deeper cross-modal fusion, achieving consistent gains across foundational models. Specifically, the approach delivers improvements of 2.67 points on one model and 2.82 points on another, compared to standard methods. The proposed method is lightweight and architecture-agnostic, making it a valuable contribution to the field of vision-language models.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30265
• PDF: https://arxiv.org/pdf/2605.30265
• Project Page: https://maplebb.github.io/LoMo/page/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#VisionLanguageModels #ModalitySubstitution #CrossModalLearning #MultimodalFusion #DeepLearningArchitectures
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
🔥 MobileGym: A Verifiable and Highly Parallel Simulation Platform for Mobile GUI Agent Research
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26114
• PDF: https://arxiv.org/pdf/2605.26114
• Project Page: https://mobilegym.github.io
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MobileGUIAgents #ParallelSimulation #ReinforcementLearning #MobileEnvironmentSimulation #GUIAgentResearch
💡 The paper introduces MobileGym, a browser-based mobile environment designed for mobile GUI agent research. The main problem addressed is the lack of a verifiable and highly parallel simulation platform for training and evaluating mobile GUI agents. Traditional methods are limited by their inability to provide deterministic outcome signals and scalable reinforcement learning.
The authors propose MobileGym as a solution, which enables deterministic evaluation and scalable reinforcement learning through JSON-based state management and parallel execution. The platform captures the full environment state as structured JSON, allowing for easy configuration, forking, and comparison of states. This approach enables a single server to host hundreds of parallel instances, with low memory requirements and fast startup times.
MobileGym features a layered state model and a declarative task-definition framework, making it practical to create and program tasks at scale. The platform also includes a single programmatic judging mechanism that delivers both deterministic evaluation verdicts and dense RL rewards. To facilitate research, the authors provide MobileGym-Bench, a collection of 416 parameterized task templates across 28 apps, including 256 test and 160 train templates.
The results demonstrate the effectiveness of MobileGym in a Sim-to-Real case study, where a model trained in the simulation environment achieves a 12.8 percentage point gain on a 256-task test set. When executed on real devices, the model retains 95.1% of the simulation-side training gain, indicating the potential of MobileGym for real-world applications. Overall, MobileGym provides a verifiable and highly parallel simulation platform for mobile GUI agent research, enabling scalable reinforcement learning and deterministic evaluation.
📅 Published on May 25
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26114
• PDF: https://arxiv.org/pdf/2605.26114
• Project Page: https://mobilegym.github.io
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#MobileGUIAgents #ParallelSimulation #ReinforcementLearning #MobileEnvironmentSimulation #GUIAgentResearch
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
👍1
AI & ML Papers
Photo
🔥 GenClaw: Code-Driven Agentic Image Generation
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30248
• PDF: https://arxiv.org/pdf/2605.30248
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticImageGeneration #CodeDrivenArt #StagedImageConstruction #VisualConstructionTechniques #ImageGenerationFrameworks
💡 The paper introduces GenClaw, a code-driven agentic image generation framework that enables precise visual construction through a staged process. The problem with existing image generation models is that they are black-box systems that rely on text-conditioned pixel synthesis, leaving them with no direct mechanism to manipulate the canvas. This leads to a repetitive cycle of prompt rewriting for generation refinement, limiting their potential for precise visual construction.
The GenClaw method addresses this issue by empowering the agent to create like a human artist, through three stages: conceptualization, sketching, and coloring. In the conceptualization stage, the agent constructs conceptual knowledge and context through search and reasoning. The agent then utilizes code, such as SVG or HTML, to render executable visual sketches in the sketching stage. Finally, it employs an image generation model to supplement textures, materials, and photorealism in the coloring stage.
By using code as a controllable intermediate canvas, GenClaw bridges linguistic reasoning and pixel synthesis, seamlessly integrating programmatic logic with the visual expressiveness of generative models. This approach transforms image generation from a black-box paradigm into a staged process, offering a step toward highly controllable and interpretable visual generation systems. The results of GenClaw demonstrate a more precise and interpretable image generation process, allowing for direct manipulation of the canvas and overcoming the limitations of existing black-box image models.
📅 Published on May 28
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.30248
• PDF: https://arxiv.org/pdf/2605.30248
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#AgenticImageGeneration #CodeDrivenArt #StagedImageConstruction #VisualConstructionTechniques #ImageGenerationFrameworks
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤5
AI & ML Papers
Photo
🔥 VibeSearchBench: Benchmarking Long-horizon Proactive Search in the Wild
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.27882
• PDF: https://arxiv.org/pdf/2605.27882
• Project Page: https://vibebench.github.io/VibeSearchBench.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#ProactiveSearch #LongHorizonSearch #MultiTurnDialogue #CollaborativeSearch #NaturalLanguageSearch
💡 The paper introduces VibeSearchBench, a benchmark for evaluating long-horizon proactive search in real-world scenarios. The problem addressed is the poor performance of large language model-based agents in search tasks that involve multi-turn dialogue and collaborative refinement of user intent. Existing benchmarks rely on over-specified queries, single-turn interactions, and fixed-schema evaluation, which do not reflect real search behavior.
To address this issue, the authors propose VibeSearch, a paradigm that involves multi-turn dialogue and collaborative refinement of vague user intent. The VibeSearchBench benchmark consists of 200 manually curated bilingual tasks across 20 domains, split into professional and daily-life subsets. Each task pairs a user persona with a schema-free ground-truth knowledge graph and is evaluated through a progressive-disclosure user simulator and a graph-matching evaluation framework.
The authors benchmark seven frontier models under two different frameworks and find that all models perform poorly, with the best F1 score being 30.30. This highlights the need for fundamental advances in long-context reasoning, proactive intent elicitation, and structured knowledge construction. The paper's contributions include the introduction of the VibeSearch paradigm, the creation of the VibeSearchBench benchmark, and the evaluation of state-of-the-art models in this new benchmark, which reveals the significant gap between current models and real-world search requirements.
📅 Published on May 27
🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.27882
• PDF: https://arxiv.org/pdf/2605.27882
• Project Page: https://vibebench.github.io/VibeSearchBench.github.io/
━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus
#ProactiveSearch #LongHorizonSearch #MultiTurnDialogue #CollaborativeSearch #NaturalLanguageSearch
GitHub
Hugging Face
The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.
❤1