AI & ML Papers

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven...

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial...

❤1

311 views21:48

262 views21:49

🔥 HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

💡 The paper introduces HyperEyes, a parallel multimodal search agent designed to optimize inference efficiency through dual-grained reinforcement learning. Existing multimodal search agents process target entities sequentially, which can lead to redundant interaction rounds and decreased efficiency. HyperEyes addresses this issue by fusing visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities.

The HyperEyes system is trained in two stages. The first stage involves a Parallel-Amenable Data Synthesis Pipeline that generates efficiency-oriented trajectories via Progressive Rejection Sampling. The second stage utilizes a Dual-Grained Efficiency-Aware Reinforcement Learning framework, which operates at two levels. At the macro level, the framework uses a trajectory-level reward called TRACE, which is designed to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, the framework adapts On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards.

The paper also introduces a new benchmark called IMEB, which jointly evaluates search capability and efficiency. The results show that HyperEyes surpasses the strongest comparable open-source agent by 9.9% in accuracy, while using 5.3x fewer tool-call rounds on average. The HyperEyes system demonstrates the effectiveness of dual-grained reinforcement learning in optimizing inference efficiency and achieving better search results. Overall, the paper contributes to the development of more efficient and effective multimodal search agents.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.07177
• PDF: https://arxiv.org/pdf/2605.07177
• GitHub: https://github.com/DeepExperience/HyperEyes ⭐ 34

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalSearchAgents #ReinforcementLearning #ParallelSearchAlgorithms #EfficiencyAwareSystems #DualGrainedLearning

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning...

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent...

323 views21:49

287 views21:49

🔥 What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

💡 This paper investigates the properties of a latent manifold that are favorable for diffusion models, which are a type of generative model. The authors argue that existing methods for defining the latent space, known as tokenizers, are primarily designed to improve reconstruction fidelity or inherit pre-trained representations, but do not necessarily produce a latent space that is well-suited for generative modeling. To address this issue, the authors study the properties of a diffusion-friendly latent manifold and identify three key properties: coherent spatial structure, local manifold continuity, and global manifold semantics. They find that these properties are more closely related to downstream generation quality than reconstruction fidelity.

To explicitly shape the latent manifold with these desirable properties, the authors propose a new method called the Prior-Aligned AutoEncoder, or PAE. The PAE uses refined priors derived from variational autoencoders and perturbation-based regularization to turn the desired properties of the latent manifold into explicit training objectives. This approach allows the PAE to directly optimize the latent space structure for improved generative modeling.

The authors evaluate the PAE on the ImageNet 256x256 dataset and find that it improves both training efficiency and generation quality compared to existing tokenizers. Specifically, the PAE achieves comparable performance to the state-of-the-art method, RAE, but with up to 13 times faster convergence under the same training setup. Additionally, the PAE achieves a new state-of-the-art result, with a generative fidelity score of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models and demonstrate the effectiveness of the PAE in producing high-quality generative models.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.07915
• PDF: https://arxiv.org/pdf/2605.07915
• Project Page: https://zhengrongyue.github.io/pae.github.io/
• GitHub: https://github.com/ZhengrongYue/PAE ⭐ 29

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LatentDiffusionModels #GenerativeModeling #AutoencoderArchitecture #LatentManifoldLearning #DiffusionBasedGenerativeModels

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned...

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve...

❤1

375 views21:49

339 views21:49

🔥 LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

💡 The paper proposes a novel approach to improve the performance of large language models through test-time scaling, which involves allocating additional computation during inference. Existing test-time scaling strategies are typically hand-crafted, relying on manual design and tuning of reasoning patterns and heuristics. This approach leaves much of the computation-allocation space unexplored, resulting in potential inefficiencies.

To address this limitation, the authors introduce AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies. Instead of designing individual strategies, researchers can create environments where optimal strategies can be discovered automatically. The key to AutoTTS lies in constructing a discovery environment that provides a tractable control space and frequent, low-cost feedback for strategy search.

The authors formulate test-time scaling as a controller synthesis problem over pre-collected reasoning trajectories and probe signals. In this framework, controllers decide when to branch, continue, probe, prune, or stop, and can be evaluated cheaply without requiring repeated calls to the language model. To make the search tractable, the authors introduce beta parameterization, which enables fine-grained execution trace feedback to improve discovery efficiency.

The proposed approach is evaluated on mathematical reasoning benchmarks, where the discovered strategies demonstrate improved accuracy-cost tradeoffs over strong manually designed baselines. The discovered strategies also generalize to held-out benchmarks and model scales, indicating their robustness and flexibility. Notably, the entire discovery process incurs a relatively low cost of 39.9 dollars and 160 minutes, making it a practical and efficient solution.

Overall, the paper contributes a novel framework for automating test-time scaling strategy discovery, which has the potential to improve the performance of large language models while reducing the need for manual design and tuning. The authors also make their data and code available, facilitating further research and development in this area.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.08083
• PDF: https://arxiv.org/pdf/2605.08083
• Project Page: https://zhengkid.github.io/AutoTTS-web/
• GitHub: https://github.com/zhengkid/AutoTTS ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #TestTimeScaling #AgenticDiscovery #AutomatedReasoning #LanguageModelOptimization

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are...

❤3

516 views21:49

521 views21:49

🔥 UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

💡 The paper introduces UniPrefill, a universal prefill acceleration framework designed to improve the inference efficiency of long-context processing in large language models. The problem addressed is that existing prefill acceleration methods are limited to specific model architectures and suffer performance degradation when applied to emerging architectures. Additionally, these methods are often incompatible with continuous batching, making it difficult to integrate them into modern inference engines.

The proposed UniPrefill framework overcomes these limitations by directly accelerating the model's computation at the token level, making it applicable to virtually any model architecture. UniPrefill is implemented as a continuous batching operator and is integrated into the vLLM inference engine, enabling seamless support for prefill-decode co-processing and tensor parallelism.

The results show that UniPrefill achieves significant speedup, with up to 2.1x improvement in Time-To-First-Token, and the acceleration becomes more pronounced as the number of concurrent requests grows. This makes UniPrefill a valuable contribution to the field, enabling more efficient and scalable long-context processing in large language models.

📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06221
• PDF: https://arxiv.org/pdf/2605.06221
• GitHub: https://github.com/qhfan/UniPrefill ⭐ 22

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongContextProcessing #PrefillAcceleration #DynamicSparsification #LargeLanguageModels #BlockWiseOptimization

UniPrefill: Universal Long-Context Prefill Acceleration via...

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency...

❤4

746 views21:49

540 views17:49

🔥 Pixal3D: Pixel-Aligned 3D Generation from Images

💡 The paper introduces Pixal3D, a new approach to generating 3D models from images that addresses the issue of fidelity, which refers to how accurately the generated 3D model represents the input image. Current 3D generative models often struggle with this due to the implicit correspondence between 2D images and 3D models. Pixal3D solves this problem by generating 3D models in a pixel-aligned way, meaning that each pixel in the input image is directly associated with a corresponding point in the 3D model.

To achieve this, the authors propose a pixel back-projection conditioning scheme that lifts image features into a 3D feature volume, establishing a direct correspondence between pixels and 3D points. This approach allows for high-fidelity 3D asset creation from images and can be scaled up to produce high-quality models. The method also extends to multi-view generation, where feature volumes from multiple views are aggregated to produce a more accurate 3D model.

The results show that Pixal3D substantially improves fidelity and approaches the level of reconstruction-based methods. Additionally, the authors demonstrate that pixel-aligned generation can benefit scene synthesis and propose a modular pipeline for producing high-fidelity, object-separated 3D scenes from images. Overall, Pixal3D provides a new approach to 3D generation that can produce high-fidelity models from single or multi-view images, and has the potential to inspire further research in this area.

📅 Published on May 11

🔗 Links:
• Project Page: https://huggingface.co/papers?q=back-projection%20conditioning
• arXiv: https://arxiv.org/abs/2605.10922
• PDF: https://arxiv.org/pdf/2605.10922
• GitHub: https://github.com/TencentARC/Pixal3D ⭐ 197

🤖 Models citing this paper:
• https://huggingface.co/TencentARC/Pixal3D

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/TencentARC/Pixal3D

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DModelGeneration #PixelAlignedRendering #ImageTo3D #3DGenerativeModels #DeepLearningForComputerVision

508 views17:49

323 views17:49

🔥 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

💡 The paper introduces NanoResearch, a multi-agent framework designed to enhance research automation through personalized assistance. The problem addressed is that current research automation systems produce uniform outputs, which can under-serve individual users due to differences in resource configurations, methodological preferences, and target output formats. To achieve personalization, three capabilities are required: accumulating reusable procedural knowledge, retaining user-specific experience, and internalizing implicit preferences.

The proposed method, NanoResearch, addresses these gaps through a tri-level co-evolution approach. It consists of three components: a skill bank that distills recurring operations into reusable procedural rules, a memory module that maintains user- and project-specific experience, and a label-free policy learning module that converts free-form feedback into persistent parameter updates. These components co-evolve, with reliable skills producing richer memory, richer memory informing better planning, and preference internalization continuously realigning the loop to each user.

The results of extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems. It progressively refines itself to produce better research at lower cost over successive cycles, making it a more effective and efficient solution for research automation. Overall, the paper contributes a novel framework for personalized research automation, addressing the limitations of current systems and providing a more tailored approach to research assistance.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10813
• PDF: https://arxiv.org/pdf/2605.10813
• GitHub: https://github.com/OpenRaiser/NanoResearch ⭐ 940

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ResearchAutomation #PersonalizedAssistance #MultiAgentFramework #ProceduralKnowledge #AutomatedResearchSystems

NanoResearch: Co-Evolving Skills, Memory, and Policy for...

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under...

420 views17:49

353 views17:49

🔥 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

💡 The paper introduces MiniCPM-V 4.5, a highly efficient 8 billion parameter multimodal large language model that achieves strong performance. The development of multimodal large language models is rapidly advancing, but their training and inference efficiency has become a major obstacle to making them more accessible and scalable. To address this challenge, the authors propose three key improvements: a unified 3D-Resampler architecture for compact encoding of images and videos, a unified learning paradigm for document knowledge and text recognition without requiring extensive data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes.

The unified 3D-Resampler architecture enables highly compact encoding of visual data, while the unified learning paradigm simplifies the learning process by eliminating the need for heavy data engineering. The hybrid reinforcement learning strategy allows the model to excel in both short and long reasoning modes, making it a versatile and efficient model.

The authors evaluated MiniCPM-V 4.5 using the OpenCompass evaluation framework and found that it outperforms widely used proprietary models such as GPT-4 and larger open-source models like Qwen2.5-VL 72B. Notably, MiniCPM-V 4.5 achieves state-of-the-art performance on the VideoMME benchmark among models under 30 billion parameters, while using significantly less GPU memory and inference time compared to other models. Specifically, it uses 46.7 percent of the GPU memory cost and 8.7 percent of the inference time of Qwen2.5-VL 7B, demonstrating its remarkable efficiency. Overall, the paper presents a significant contribution to the development of efficient and scalable multimodal large language models.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and...

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core...

545 views17:49

414 views03:50

🔥 RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

💡 The paper introduces RoboMemArena, a comprehensive robotic memory benchmark that addresses the limitations of existing benchmarks by providing a large-scale and diverse set of tasks with real-world evaluation. The benchmark consists of 26 tasks with average trajectory lengths of over 1000 steps per task, and 68.9 percent of subtasks require memory dependence. The tasks are generated using a vision-language model that designs and composes subtasks, generates full trajectories, and provides memory-related annotations.

To tackle the challenges of the RoboMemArena benchmark, the authors propose PrediMem, a dual-system vision-language architecture that improves memory management through predictive coding. PrediMem consists of a high-level vision-language model planner that manages a memory bank with recent and keyframe buffers, and uses a predictive coding head to enhance sensitivity to task dynamics.

The authors evaluate PrediMem on the RoboMemArena benchmark and demonstrate that it outperforms all baseline models. The results provide insights into memory management, model architecture, and scaling laws for complex memory systems. The paper contributes to the development of robotic intelligence by providing a comprehensive benchmark and a state-of-the-art model that can effectively manage memory in partially observable environments.

The key contributions of the paper are the introduction of the RoboMemArena benchmark, which provides a challenging and diverse set of tasks for evaluating robotic memory, and the proposal of the PrediMem model, which demonstrates improved memory management through predictive coding. The paper also provides a thorough evaluation of the PrediMem model on the RoboMemArena benchmark, highlighting its effectiveness in managing memory in complex tasks. Overall, the paper advances the state-of-the-art in robotic memory and provides a foundation for future research in this area.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10921
• PDF: https://arxiv.org/pdf/2605.10921
• Project Page: https://robomemarena.github.io/
• GitHub: https://github.com/OpenHelix-Team/RoboMemArena ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticMemoryBenchmark #VisionLanguageModel #RoboticsAndMemory #ArtificialIntelligenceBenchmarking #RoboMemArena

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However,...

428 views03:50

277 views03:50

🔥 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

💡 This paper proposes a novel approach called CapVector to improve the performance of vision-language-action models. The problem addressed is that pre-trained models often fail to improve performance and reduce adaptation costs during standard supervised finetuning. Advanced finetuning methods with auxiliary training objectives can improve performance but incur significant computational overhead.

The proposed method decouples the auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead. This is achieved by training the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters difference between the two models is interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pre-trained parameters to form a capability-enhanced meta model.

The method also uses a lightweight orthogonal regularization loss to augment standard supervised finetuning, which reduces computational overhead. The results show that the capability vectors are effective and versatile across diverse models, and can generalize to novel environments and embodiments without additional training. The proposed approach achieves performance comparable to auxiliary finetuned baselines with reduced computational overhead, making it a promising solution for improving vision-language-action models.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10903
• PDF: https://arxiv.org/pdf/2605.10903
• Project Page: https://capvector.github.io/
• GitHub: https://github.com/OpenHelix-Team/CapVector ⭐ 26

🤖 Models citing this paper:
• https://huggingface.co/haofuly/capvector_models_collection

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ParametricSpaceLearning #TransferableCapabilities #VisionLanguageAction #MultimodalLearning

CapVector: Learning Transferable Capability Vectors in Parametric...

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised...

254 views03:50

236 views03:50

🔥 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

💡 This paper introduces a new approach called rStar that improves the reasoning capabilities of small language models without requiring fine-tuning or larger models. The problem addressed is that small language models often struggle with complex reasoning tasks, which can limit their ability to solve problems. The rStar method involves a self-play mutual generation-discrimination process, where one small language model generates reasoning trajectories using a Monte Carlo Tree Search with human-like reasoning actions, and another similar model acts as a discriminator to verify these trajectories. The trajectories that are mutually agreed upon are considered more likely to be correct. The results show that rStar can effectively solve diverse reasoning problems, including math and strategy-based tasks, and significantly improves the accuracy of small language models. For example, rStar boosts the accuracy of one model from 12.51 percent to 63.91 percent on a specific task, and from 36.46 percent to 81.88 percent on another model. Overall, the rStar approach makes smaller language models stronger problem-solvers without requiring additional training or larger models.

📅 Published on Aug 12, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2408.06195
• PDF: https://arxiv.org/pdf/2408.06195
• GitHub: https://github.com/codelion/optillm ⭐ 3.7k

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/algorithmicsuperintelligence/OptiLLM
• https://huggingface.co/spaces/fabiodr/optillm
• https://huggingface.co/spaces/EduuGomes/CachoeiraBot

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MutualReasoning #LLMProblemSolving #MonteCarloTreeSearch #SelfPlayLearning #LanguageModelOptimization

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar...

379 views03:50