AI & ML Papers

3D Gaussian Splatting for Real-Time Radiance Field Rendering

Radiance Field methods have recently revolutionized novel-view synthesis of scenes captured with multiple photos or videos. However, achieving high visual quality still requires neural networks...

❤2👍2

919 views05:36

695 views01:37

🔥 Fish Audio S2 Technical Report

💡 The paper introduces Fish Audio S2, an open source text to speech system that features multi speaker capabilities, multi turn generation, and instruction following control through natural language descriptions. The system utilizes a multi stage training approach, which includes a staged data pipeline covering video captioning, speech captioning, voice quality assessment, and reward modeling. This approach allows for scalable training and improves the overall performance of the system. The authors also release their model weights, fine tuning code, and an inference engine, making it production ready for streaming. The inference engine achieves a real time factor of 0.195 and a time to first audio of below 100 milliseconds, indicating its efficiency and speed. The code and weights are made available on GitHub and Hugging Face, and users are encouraged to try custom voices on the website. Overall, the paper contributes to the advancement of open source text to speech systems, providing a robust and efficient solution for generating high quality speech.

📅 Published on Mar 9

🔗 Links:
• arXiv: https://arxiv.org/abs/2603.08823
• PDF: https://arxiv.org/pdf/2603.08823
• Project Page: https://fish.audio/
• GitHub: https://github.com/fishaudio/fish-speech ⭐ 30.2k

🤖 Models citing this paper:
• https://huggingface.co/fishaudio/s2-pro
• https://huggingface.co/drbaph/s2-pro-fp8
• https://huggingface.co/mlx-community/fish-audio-s2-pro-bf16

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Izzyzlin/CFSDD

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/artificialguybr/fish-s2-pro-zero
• https://huggingface.co/spaces/fguilleme/fish-s2-pro-zero
• https://huggingface.co/spaces/MAYA-AI/fish-s2-pro-zero

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToSpeechSystems #MultispeakerSynthesis #NaturalLanguageProcessing #SpeechGenerationModels #RealTimeAudioProcessing

Fish Audio S2 Technical Report

We introduce Fish Audio S2, an open-sourced text-to-speech system featuring multi-speaker, multi-turn generation, and, most importantly, instruction-following control via natural-language...

❤4👍2

808 views01:37

❤1

520 views21:48

🔥 Flow-OPD: On-Policy Distillation for Flow Matching Models

💡 The paper addresses limitations in existing Flow Matching text-to-image models, which suffer from two main issues: reward sparsity and gradient interference. These problems lead to poor generation quality and alignment metrics. To overcome these challenges, the authors propose Flow-OPD, a two-stage alignment approach that combines on-policy distillation and manifold anchor regularization.

In the first stage, the authors fine-tune domain-specialized teacher models using single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling. Then, they establish a robust initial policy through a Flow-based Cold-Start scheme and consolidate heterogeneous expertise into a single student model.

The authors also introduce Manifold Anchor Regularization, which leverages a task-agnostic teacher to provide full-data supervision and anchors generation to a high-quality manifold. This helps mitigate aesthetic degradation commonly observed in purely RL-driven alignment.

The results show that Flow-OPD significantly improves generation quality and alignment metrics, raising the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94. This represents an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment. The authors also observe an emergent teacher-surpassing effect, where the student model outperforms the teacher models. Overall, Flow-OPD establishes a scalable alignment paradigm for building generalist text-to-image models.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.08063
• PDF: https://arxiv.org/pdf/2605.08063
• Project Page: https://costaliya.github.io/Flow-OPD/
• GitHub: https://github.com/CostaliyA/Flow-OPD ⭐ 79

🤖 Models citing this paper:
• https://huggingface.co/CostaliyA/Flow-OPD

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FlowMatchingModels #OnPolicyDistillation #TextToImageSynthesis #ManifoldAnchorRegularization #FlowOPD

Flow-OPD: On-Policy Distillation for Flow Matching Models

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient...

❤1

446 views21:48

🔥 HumanNet: Scaling Human-centric Video Learning to One Million Hours

💡 The paper introduces HumanNet, a large-scale human-centric video dataset that captures how humans interact with the physical world, with the goal of advancing embodied intelligence. The problem addressed is the lack of large, diverse, and richly annotated human activity data, which hinders progress in learning physical interaction. To solve this, the authors created a one-million-hour video corpus that spans first-person and third-person perspectives, covering various activities, human-object interactions, and long-horizon behaviors in diverse environments. The dataset is annotated with interaction-centric information, including captions, motion descriptions, and hand and body-related signals.

The method involves a systematic data curation paradigm that treats human-centric filtering, temporal structuring, viewpoint diversity, and annotation enrichment as key design principles. This approach transforms unstructured internet video into a scalable substrate for representation learning, activity understanding, motion generation, and human-to-robot transfer.

The results show that HumanNet can be used to train vision-language-action models, and that egocentric human video can effectively replace robot data for training. In a controlled experiment, the authors found that continued training with 1000 hours of egocentric video from HumanNet surpassed the performance of continued training with 100 hours of real-robot data. This suggests that human-centric video can be a scalable and cost-effective substitute for robot data, and that HumanNet can be used to explore the opportunity to scale embodied foundation models using human-centric videos. Overall, the paper contributes a large-scale dataset and a systematic data curation paradigm that can advance embodied intelligence and learning physical interaction.

📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06747
• PDF: https://arxiv.org/pdf/2605.06747
• Project Page: https://dagroup-pku.github.io/HumanNet/
• GitHub: https://github.com/DAGroup-PKU/HumanNet ⭐ 65

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#HumanCentricVideoLearning #EmbodiedIntelligence #LargeScaleVideoDatasets #HumanActivityRecognition #VideoUnderstanding

HumanNet: Scaling Human-centric Video Learning to One Million Hours

Progress in embodied intelligence increasingly depends on scalable data infrastructure. While vision and language have scaled with internet corpora, learning physical interaction remains...

❤1

458 views21:48

This media is not supported in your browser

0:52

327 views21:48

264 views21:48

🔥 MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven Dance Video Generation

💡 The paper presents MACE-Dance, a music-driven dance video generation framework that combines cascaded Mixture-of-Experts with diffusion models and specialized training strategies. The goal is to generate high-quality dance videos with realistic human motion and visual appearance, driven by music. Existing methods in related domains such as music-driven 3D dance generation and pose-driven image animation cannot be directly applied to this task, and current studies struggle to achieve both high-quality visual appearance and realistic human motion.

The MACE-Dance framework consists of two experts: the Motion Expert and the Appearance Expert. The Motion Expert generates 3D motion from music, ensuring kinematic plausibility and artistic expressiveness, using a diffusion model with a BiMamba-Transformer hybrid architecture and a Guidance-Free Training strategy. The Appearance Expert synthesizes motion- and reference-conditioned videos, preserving visual identity with spatiotemporal coherence, using a decoupled kinematic-aesthetic fine-tuning strategy.

The paper claims that MACE-Dance achieves state-of-the-art performance in both 3D dance generation and pose-driven image animation. To evaluate the framework, the authors curate a large-scale and diverse dataset and design a motion-appearance evaluation protocol. The results show that MACE-Dance outperforms existing methods, demonstrating its effectiveness in generating high-quality dance videos with realistic human motion and visual appearance. The code for MACE-Dance is made available, allowing for further research and development in music-driven dance video generation.

📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2512.18181
• PDF: https://arxiv.org/pdf/2512.18181
• GitHub: https://github.com/AMAP-ML/MACE-Dance ⭐ 84

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MusicDrivenVideoGeneration #DanceVideoSynthesis #MotionAppearanceModelling #CascadedMixtureOfExperts #MusicDrivenDanceGeneration

MACE-Dance: Motion-Appearance Cascaded Experts for Music-Driven...

With the rise of online dance-video platforms and rapid advances in AI-generated content (AIGC), music-driven dance generation has emerged as a compelling research direction. Despite substantial...

❤1

311 views21:48

262 views21:49

🔥 HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning for Parallel Multimodal Search Agents

💡 The paper introduces HyperEyes, a parallel multimodal search agent designed to optimize inference efficiency through dual-grained reinforcement learning. Existing multimodal search agents process target entities sequentially, which can lead to redundant interaction rounds and decreased efficiency. HyperEyes addresses this issue by fusing visual grounding and retrieval into a single atomic action, enabling concurrent search across multiple entities.

The HyperEyes system is trained in two stages. The first stage involves a Parallel-Amenable Data Synthesis Pipeline that generates efficiency-oriented trajectories via Progressive Rejection Sampling. The second stage utilizes a Dual-Grained Efficiency-Aware Reinforcement Learning framework, which operates at two levels. At the macro level, the framework uses a trajectory-level reward called TRACE, which is designed to suppress superfluous tool calls without restricting genuine multi-hop search. At the micro level, the framework adapts On-Policy Distillation to inject dense token-level corrective signals from an external teacher on failed rollouts, mitigating the credit-assignment deficiency of sparse outcome rewards.

The paper also introduces a new benchmark called IMEB, which jointly evaluates search capability and efficiency. The results show that HyperEyes surpasses the strongest comparable open-source agent by 9.9% in accuracy, while using 5.3x fewer tool-call rounds on average. The HyperEyes system demonstrates the effectiveness of dual-grained reinforcement learning in optimizing inference efficiency and achieving better search results. Overall, the paper contributes to the development of more efficient and effective multimodal search agents.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.07177
• PDF: https://arxiv.org/pdf/2605.07177
• GitHub: https://github.com/DeepExperience/HyperEyes ⭐ 34

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalSearchAgents #ReinforcementLearning #ParallelSearchAlgorithms #EfficiencyAwareSystems #DualGrainedLearning

HyperEyes: Dual-Grained Efficiency-Aware Reinforcement Learning...

Existing multimodal search agents process target entities sequentially, issuing one tool call per entity and accumulating redundant interaction rounds whenever a query decomposes into independent...

323 views21:49

287 views21:49

🔥 What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned Autoencoders for Latent Diffusion

💡 This paper investigates the properties of a latent manifold that are favorable for diffusion models, which are a type of generative model. The authors argue that existing methods for defining the latent space, known as tokenizers, are primarily designed to improve reconstruction fidelity or inherit pre-trained representations, but do not necessarily produce a latent space that is well-suited for generative modeling. To address this issue, the authors study the properties of a diffusion-friendly latent manifold and identify three key properties: coherent spatial structure, local manifold continuity, and global manifold semantics. They find that these properties are more closely related to downstream generation quality than reconstruction fidelity.

To explicitly shape the latent manifold with these desirable properties, the authors propose a new method called the Prior-Aligned AutoEncoder, or PAE. The PAE uses refined priors derived from variational autoencoders and perturbation-based regularization to turn the desired properties of the latent manifold into explicit training objectives. This approach allows the PAE to directly optimize the latent space structure for improved generative modeling.

The authors evaluate the PAE on the ImageNet 256x256 dataset and find that it improves both training efficiency and generation quality compared to existing tokenizers. Specifically, the PAE achieves comparable performance to the state-of-the-art method, RAE, but with up to 13 times faster convergence under the same training setup. Additionally, the PAE achieves a new state-of-the-art result, with a generative fidelity score of 1.03. These results highlight the importance of organizing the latent manifold for latent diffusion models and demonstrate the effectiveness of the PAE in producing high-quality generative models.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.07915
• PDF: https://arxiv.org/pdf/2605.07915
• Project Page: https://zhengrongyue.github.io/pae.github.io/
• GitHub: https://github.com/ZhengrongYue/PAE ⭐ 29

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LatentDiffusionModels #GenerativeModeling #AutoencoderArchitecture #LatentManifoldLearning #DiffusionBasedGenerativeModels

What Matters for Diffusion-Friendly Latent Manifold? Prior-Aligned...

Tokenizers are a crucial component of latent diffusion models, as they define the latent space in which diffusion models operate. However, existing tokenizers are primarily designed to improve...

❤1

375 views21:49

339 views21:49

🔥 LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

💡 The paper proposes a novel approach to improve the performance of large language models through test-time scaling, which involves allocating additional computation during inference. Existing test-time scaling strategies are typically hand-crafted, relying on manual design and tuning of reasoning patterns and heuristics. This approach leaves much of the computation-allocation space unexplored, resulting in potential inefficiencies.

To address this limitation, the authors introduce AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies. Instead of designing individual strategies, researchers can create environments where optimal strategies can be discovered automatically. The key to AutoTTS lies in constructing a discovery environment that provides a tractable control space and frequent, low-cost feedback for strategy search.

The authors formulate test-time scaling as a controller synthesis problem over pre-collected reasoning trajectories and probe signals. In this framework, controllers decide when to branch, continue, probe, prune, or stop, and can be evaluated cheaply without requiring repeated calls to the language model. To make the search tractable, the authors introduce beta parameterization, which enables fine-grained execution trace feedback to improve discovery efficiency.

The proposed approach is evaluated on mathematical reasoning benchmarks, where the discovered strategies demonstrate improved accuracy-cost tradeoffs over strong manually designed baselines. The discovered strategies also generalize to held-out benchmarks and model scales, indicating their robustness and flexibility. Notably, the entire discovery process incurs a relatively low cost of 39.9 dollars and 160 minutes, making it a practical and efficient solution.

Overall, the paper contributes a novel framework for automating test-time scaling strategy discovery, which has the potential to improve the performance of large language models while reducing the need for manual design and tuning. The authors also make their data and code available, facilitating further research and development in this area.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.08083
• PDF: https://arxiv.org/pdf/2605.08083
• Project Page: https://zhengkid.github.io/AutoTTS-web/
• GitHub: https://github.com/zhengkid/AutoTTS ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #TestTimeScaling #AgenticDiscovery #AutomatedReasoning #LanguageModelOptimization

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are...

❤3

517 views21:49

522 views21:49

🔥 UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

💡 The paper introduces UniPrefill, a universal prefill acceleration framework designed to improve the inference efficiency of long-context processing in large language models. The problem addressed is that existing prefill acceleration methods are limited to specific model architectures and suffer performance degradation when applied to emerging architectures. Additionally, these methods are often incompatible with continuous batching, making it difficult to integrate them into modern inference engines.

The proposed UniPrefill framework overcomes these limitations by directly accelerating the model's computation at the token level, making it applicable to virtually any model architecture. UniPrefill is implemented as a continuous batching operator and is integrated into the vLLM inference engine, enabling seamless support for prefill-decode co-processing and tensor parallelism.

The results show that UniPrefill achieves significant speedup, with up to 2.1x improvement in Time-To-First-Token, and the acceleration becomes more pronounced as the number of concurrent requests grows. This makes UniPrefill a valuable contribution to the field, enabling more efficient and scalable long-context processing in large language models.

📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06221
• PDF: https://arxiv.org/pdf/2605.06221
• GitHub: https://github.com/qhfan/UniPrefill ⭐ 22

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongContextProcessing #PrefillAcceleration #DynamicSparsification #LargeLanguageModels #BlockWiseOptimization

UniPrefill: Universal Long-Context Prefill Acceleration via...

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency...

❤4

747 views21:49

540 views17:49

🔥 Pixal3D: Pixel-Aligned 3D Generation from Images

💡 The paper introduces Pixal3D, a new approach to generating 3D models from images that addresses the issue of fidelity, which refers to how accurately the generated 3D model represents the input image. Current 3D generative models often struggle with this due to the implicit correspondence between 2D images and 3D models. Pixal3D solves this problem by generating 3D models in a pixel-aligned way, meaning that each pixel in the input image is directly associated with a corresponding point in the 3D model.

To achieve this, the authors propose a pixel back-projection conditioning scheme that lifts image features into a 3D feature volume, establishing a direct correspondence between pixels and 3D points. This approach allows for high-fidelity 3D asset creation from images and can be scaled up to produce high-quality models. The method also extends to multi-view generation, where feature volumes from multiple views are aggregated to produce a more accurate 3D model.

The results show that Pixal3D substantially improves fidelity and approaches the level of reconstruction-based methods. Additionally, the authors demonstrate that pixel-aligned generation can benefit scene synthesis and propose a modular pipeline for producing high-fidelity, object-separated 3D scenes from images. Overall, Pixal3D provides a new approach to 3D generation that can produce high-fidelity models from single or multi-view images, and has the potential to inspire further research in this area.

📅 Published on May 11

🔗 Links:
• Project Page: https://huggingface.co/papers?q=back-projection%20conditioning
• arXiv: https://arxiv.org/abs/2605.10922
• PDF: https://arxiv.org/pdf/2605.10922
• GitHub: https://github.com/TencentARC/Pixal3D ⭐ 197

🤖 Models citing this paper:
• https://huggingface.co/TencentARC/Pixal3D

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/TencentARC/Pixal3D

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DModelGeneration #PixelAlignedRendering #ImageTo3D #3DGenerativeModels #DeepLearningForComputerVision

508 views17:49

323 views17:49

🔥 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

💡 The paper introduces NanoResearch, a multi-agent framework designed to enhance research automation through personalized assistance. The problem addressed is that current research automation systems produce uniform outputs, which can under-serve individual users due to differences in resource configurations, methodological preferences, and target output formats. To achieve personalization, three capabilities are required: accumulating reusable procedural knowledge, retaining user-specific experience, and internalizing implicit preferences.

The proposed method, NanoResearch, addresses these gaps through a tri-level co-evolution approach. It consists of three components: a skill bank that distills recurring operations into reusable procedural rules, a memory module that maintains user- and project-specific experience, and a label-free policy learning module that converts free-form feedback into persistent parameter updates. These components co-evolve, with reliable skills producing richer memory, richer memory informing better planning, and preference internalization continuously realigning the loop to each user.

The results of extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems. It progressively refines itself to produce better research at lower cost over successive cycles, making it a more effective and efficient solution for research automation. Overall, the paper contributes a novel framework for personalized research automation, addressing the limitations of current systems and providing a more tailored approach to research assistance.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10813
• PDF: https://arxiv.org/pdf/2605.10813
• GitHub: https://github.com/OpenRaiser/NanoResearch ⭐ 940

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ResearchAutomation #PersonalizedAssistance #MultiAgentFramework #ProceduralKnowledge #AutomatedResearchSystems

NanoResearch: Co-Evolving Skills, Memory, and Policy for...

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under...

420 views17:49