AI & ML Papers

UniPrefill: Universal Long-Context Prefill Acceleration via...

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency...

❤4

745 views21:49

540 views17:49

🔥 Pixal3D: Pixel-Aligned 3D Generation from Images

💡 The paper introduces Pixal3D, a new approach to generating 3D models from images that addresses the issue of fidelity, which refers to how accurately the generated 3D model represents the input image. Current 3D generative models often struggle with this due to the implicit correspondence between 2D images and 3D models. Pixal3D solves this problem by generating 3D models in a pixel-aligned way, meaning that each pixel in the input image is directly associated with a corresponding point in the 3D model.

To achieve this, the authors propose a pixel back-projection conditioning scheme that lifts image features into a 3D feature volume, establishing a direct correspondence between pixels and 3D points. This approach allows for high-fidelity 3D asset creation from images and can be scaled up to produce high-quality models. The method also extends to multi-view generation, where feature volumes from multiple views are aggregated to produce a more accurate 3D model.

The results show that Pixal3D substantially improves fidelity and approaches the level of reconstruction-based methods. Additionally, the authors demonstrate that pixel-aligned generation can benefit scene synthesis and propose a modular pipeline for producing high-fidelity, object-separated 3D scenes from images. Overall, Pixal3D provides a new approach to 3D generation that can produce high-fidelity models from single or multi-view images, and has the potential to inspire further research in this area.

📅 Published on May 11

🔗 Links:
• Project Page: https://huggingface.co/papers?q=back-projection%20conditioning
• arXiv: https://arxiv.org/abs/2605.10922
• PDF: https://arxiv.org/pdf/2605.10922
• GitHub: https://github.com/TencentARC/Pixal3D ⭐ 197

🤖 Models citing this paper:
• https://huggingface.co/TencentARC/Pixal3D

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/TencentARC/Pixal3D

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DModelGeneration #PixelAlignedRendering #ImageTo3D #3DGenerativeModels #DeepLearningForComputerVision

508 views17:49

323 views17:49

🔥 NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

💡 The paper introduces NanoResearch, a multi-agent framework designed to enhance research automation through personalized assistance. The problem addressed is that current research automation systems produce uniform outputs, which can under-serve individual users due to differences in resource configurations, methodological preferences, and target output formats. To achieve personalization, three capabilities are required: accumulating reusable procedural knowledge, retaining user-specific experience, and internalizing implicit preferences.

The proposed method, NanoResearch, addresses these gaps through a tri-level co-evolution approach. It consists of three components: a skill bank that distills recurring operations into reusable procedural rules, a memory module that maintains user- and project-specific experience, and a label-free policy learning module that converts free-form feedback into persistent parameter updates. These components co-evolve, with reliable skills producing richer memory, richer memory informing better planning, and preference internalization continuously realigning the loop to each user.

The results of extensive experiments demonstrate that NanoResearch delivers substantial gains over state-of-the-art AI research systems. It progressively refines itself to produce better research at lower cost over successive cycles, making it a more effective and efficient solution for research automation. Overall, the paper contributes a novel framework for personalized research automation, addressing the limitations of current systems and providing a more tailored approach to research assistance.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10813
• PDF: https://arxiv.org/pdf/2605.10813
• GitHub: https://github.com/OpenRaiser/NanoResearch ⭐ 940

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ResearchAutomation #PersonalizedAssistance #MultiAgentFramework #ProceduralKnowledge #AutomatedResearchSystems

NanoResearch: Co-Evolving Skills, Memory, and Policy for...

LLM-powered multi-agent systems can now automate the full research pipeline from ideation to paper writing, but a fundamental question remains: automation for whom? Researchers operate under...

419 views17:49

353 views17:49

🔥 MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and Training Recipe

💡 The paper introduces MiniCPM-V 4.5, a highly efficient 8 billion parameter multimodal large language model that achieves strong performance. The development of multimodal large language models is rapidly advancing, but their training and inference efficiency has become a major obstacle to making them more accessible and scalable. To address this challenge, the authors propose three key improvements: a unified 3D-Resampler architecture for compact encoding of images and videos, a unified learning paradigm for document knowledge and text recognition without requiring extensive data engineering, and a hybrid reinforcement learning strategy for proficiency in both short and long reasoning modes.

The unified 3D-Resampler architecture enables highly compact encoding of visual data, while the unified learning paradigm simplifies the learning process by eliminating the need for heavy data engineering. The hybrid reinforcement learning strategy allows the model to excel in both short and long reasoning modes, making it a versatile and efficient model.

The authors evaluated MiniCPM-V 4.5 using the OpenCompass evaluation framework and found that it outperforms widely used proprietary models such as GPT-4 and larger open-source models like Qwen2.5-VL 72B. Notably, MiniCPM-V 4.5 achieves state-of-the-art performance on the VideoMME benchmark among models under 30 billion parameters, while using significantly less GPU memory and inference time compared to other models. Specifically, it uses 46.7 percent of the GPU memory cost and 8.7 percent of the inference time of Qwen2.5-VL 7B, demonstrating its remarkable efficiency. Overall, the paper presents a significant contribution to the development of efficient and scalable multimodal large language models.

MiniCPM-V 4.5: Cooking Efficient MLLMs via Architecture, Data, and...

Multimodal Large Language Models (MLLMs) are undergoing rapid progress and represent the frontier of AI development. However, their training and inference efficiency have emerged as a core...

545 views17:49

414 views03:50

🔥 RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

💡 The paper introduces RoboMemArena, a comprehensive robotic memory benchmark that addresses the limitations of existing benchmarks by providing a large-scale and diverse set of tasks with real-world evaluation. The benchmark consists of 26 tasks with average trajectory lengths of over 1000 steps per task, and 68.9 percent of subtasks require memory dependence. The tasks are generated using a vision-language model that designs and composes subtasks, generates full trajectories, and provides memory-related annotations.

To tackle the challenges of the RoboMemArena benchmark, the authors propose PrediMem, a dual-system vision-language architecture that improves memory management through predictive coding. PrediMem consists of a high-level vision-language model planner that manages a memory bank with recent and keyframe buffers, and uses a predictive coding head to enhance sensitivity to task dynamics.

The authors evaluate PrediMem on the RoboMemArena benchmark and demonstrate that it outperforms all baseline models. The results provide insights into memory management, model architecture, and scaling laws for complex memory systems. The paper contributes to the development of robotic intelligence by providing a comprehensive benchmark and a state-of-the-art model that can effectively manage memory in partially observable environments.

The key contributions of the paper are the introduction of the RoboMemArena benchmark, which provides a challenging and diverse set of tasks for evaluating robotic memory, and the proposal of the PrediMem model, which demonstrates improved memory management through predictive coding. The paper also provides a thorough evaluation of the PrediMem model on the RoboMemArena benchmark, highlighting its effectiveness in managing memory in complex tasks. Overall, the paper advances the state-of-the-art in robotic memory and provides a foundation for future research in this area.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10921
• PDF: https://arxiv.org/pdf/2605.10921
• Project Page: https://robomemarena.github.io/
• GitHub: https://github.com/OpenHelix-Team/RoboMemArena ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticMemoryBenchmark #VisionLanguageModel #RoboticsAndMemory #ArtificialIntelligenceBenchmarking #RoboMemArena

RoboMemArena: A Comprehensive and Challenging Robotic Memory Benchmark

Memory is a critical component of robotic intelligence, as robots must rely on past observations and actions to accomplish long-horizon tasks in partially observable environments. However,...

428 views03:50

277 views03:50

🔥 CapVector: Learning Transferable Capability Vectors in Parametric Space for Vision-Language-Action Models

💡 This paper proposes a novel approach called CapVector to improve the performance of vision-language-action models. The problem addressed is that pre-trained models often fail to improve performance and reduce adaptation costs during standard supervised finetuning. Advanced finetuning methods with auxiliary training objectives can improve performance but incur significant computational overhead.

The proposed method decouples the auxiliary training objectives from standard supervised finetuning to enhance model capabilities while reducing computational overhead. This is achieved by training the model to converge on a small-scale task set using two distinct training strategies, resulting in two finetuned models. The parameters difference between the two models is interpreted as capability vectors provided by auxiliary objectives. These vectors are then merged with pre-trained parameters to form a capability-enhanced meta model.

The method also uses a lightweight orthogonal regularization loss to augment standard supervised finetuning, which reduces computational overhead. The results show that the capability vectors are effective and versatile across diverse models, and can generalize to novel environments and embodiments without additional training. The proposed approach achieves performance comparable to auxiliary finetuned baselines with reduced computational overhead, making it a promising solution for improving vision-language-action models.

📅 Published on May 11

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.10903
• PDF: https://arxiv.org/pdf/2605.10903
• Project Page: https://capvector.github.io/
• GitHub: https://github.com/OpenHelix-Team/CapVector ⭐ 26

🤖 Models citing this paper:
• https://huggingface.co/haofuly/capvector_models_collection

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionLanguageModels #ParametricSpaceLearning #TransferableCapabilities #VisionLanguageAction #MultimodalLearning

CapVector: Learning Transferable Capability Vectors in Parametric...

This paper proposes a novel approach to address the challenge that pretrained VLA models often fail to effectively improve performance and reduce adaptation costs during standard supervised...

254 views03:50

236 views03:50

🔥 Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

💡 This paper introduces a new approach called rStar that improves the reasoning capabilities of small language models without requiring fine-tuning or larger models. The problem addressed is that small language models often struggle with complex reasoning tasks, which can limit their ability to solve problems. The rStar method involves a self-play mutual generation-discrimination process, where one small language model generates reasoning trajectories using a Monte Carlo Tree Search with human-like reasoning actions, and another similar model acts as a discriminator to verify these trajectories. The trajectories that are mutually agreed upon are considered more likely to be correct. The results show that rStar can effectively solve diverse reasoning problems, including math and strategy-based tasks, and significantly improves the accuracy of small language models. For example, rStar boosts the accuracy of one model from 12.51 percent to 63.91 percent on a specific task, and from 36.46 percent to 81.88 percent on another model. Overall, the rStar approach makes smaller language models stronger problem-solvers without requiring additional training or larger models.

📅 Published on Aug 12, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2408.06195
• PDF: https://arxiv.org/pdf/2408.06195
• GitHub: https://github.com/codelion/optillm ⭐ 3.7k

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/algorithmicsuperintelligence/OptiLLM
• https://huggingface.co/spaces/fabiodr/optillm
• https://huggingface.co/spaces/EduuGomes/CachoeiraBot

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MutualReasoning #LLMProblemSolving #MonteCarloTreeSearch #SelfPlayLearning #LanguageModelOptimization

Mutual Reasoning Makes Smaller LLMs Stronger Problem-Solvers

This paper introduces rStar, a self-play mutual reasoning approach that significantly improves reasoning capabilities of small language models (SLMs) without fine-tuning or superior models. rStar...

379 views03:50

359 views03:50

🔥 MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

💡 The paper introduces MiniCPM-o 4.5, a model that enables real-time full-duplex multimodal interaction, allowing it to see, listen, and speak simultaneously in real-time. The current state of multimodal large language models has limitations, including separated perception and response phases and reactive behavior, which prevent them from incorporating new inputs for timely adjustments during generation. To address these issues, the authors propose Omni-Flow, a unified streaming framework that aligns omni-modal inputs and outputs along a shared temporal axis, converting conventional turn-based interaction into a full-duplex, time-aligned process. This enables simultaneous perception and response and allows proactive behavior to arise within the same framework. MiniCPM-o 4.5 has 9B parameters and achieves state-of-the-art open-source performance, surpassing other models in omni-modal understanding and speech generation while delivering better computation efficiency. The model can perform real-time full-duplex omni-modal interaction on edge devices with less than 12GB RAM cost, making it a significant step towards human-like multimodal interaction. The key contributions of the paper are the introduction of Omni-Flow and the development of MiniCPM-o 4.5, which mitigates the gaps in current multimodal interaction models and enables real-time full-duplex omni-modal interaction. The results show that MiniCPM-o 4.5 approaches the performance of other models, such as Gemini 2.5 Flash, and surpasses Qwen3-Omni-30B-A3B in omni-modal understanding and speech generation, demonstrating its effectiveness and efficiency.

📅 Published on Apr 30

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.27393
• PDF: https://arxiv.org/pdf/2604.27393
• Project Page: https://huggingface.co/openbmb/MiniCPM-o-4_5
• GitHub: https://github.com/OpenBMB/MiniCPM-o ⭐ 24.7k

🤖 Models citing this paper:
• https://huggingface.co/openbmb/MiniCPM-o-4_5
• https://huggingface.co/openbmb/MiniCPM-V-4.6
• https://huggingface.co/openbmb/MiniCPM-V-4.6-Thinking

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/openbmb/MiniCPM-V-4.6-Demo
• https://huggingface.co/spaces/usermma/treadon-MiniCPM-V-4.6-Abliterated-AND-Disinhibited-Q4_K_M-gguf
• https://huggingface.co/spaces/lspatilvs/Medical-Report-OCR

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalInteraction #FullDuplexCommunication #OmniModalProcessing #RealTimeLanguageModels #MultimodalLargeLanguageModels

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction

Recent progress in multimodal large language models (MLLMs) has brought AI capabilities from static offline data processing to real-time streaming interaction, yet they still remain far from...

❤1

504 views03:50

377 views13:50

🔥 SenseNova-U1: Unifying Multimodal Understanding and Generation with NEO-unify Architecture

💡 The paper introduces SenseNova-U1, a unified multimodal model that integrates understanding and generation into a single process, overcoming the traditional divide between these two tasks. Current large vision-language models treat understanding and generation as separate problems, leading to fragmented architectures and misaligned representation spaces. The authors argue that this divide hinders the emergence of native multimodal intelligence and propose a new paradigm, NEO-unify, which views understanding and generation as synergistic aspects of a single process.

The authors present two variants of SenseNova-U1, built on dense and mixture-of-experts understanding baselines, and demonstrate their performance across various tasks, including text understanding, vision-language perception, knowledge reasoning, agentic decision-making, and spatial intelligence. The models also excel in image synthesis, infographic generation, and interleaved vision-language generation, showing strong semantic consistency and visual fidelity.

The paper provides detailed information on model design, data preprocessing, pre- and post-training, and inference strategies, supporting community research. The results show that SenseNova-U1 models perform strongly in vision-language-action and world model scenarios, indicating a broader roadmap where models can think and act across modalities in a native manner. The authors conclude that multimodal AI should focus on building a unified system, rather than connecting separate systems, allowing necessary capabilities to emerge from within. Overall, the paper contributes to the development of unified multimodal models that can integrate understanding and generation, paving the way for more advanced and native multimodal intelligence.

📅 Published on May 12

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12500
• PDF: https://arxiv.org/pdf/2605.12500
• GitHub: https://github.com/OpenSenseNova/SenseNova-U1 ⭐ 1.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalUnderstanding #NEOunifyArchitecture #VisionLanguageModels #MultimodalGeneration #UnifiedIntelligenceModels

SenseNova-U1: Unifying Multimodal Understanding and Generation...

Recent large vision-language models (VLMs) remain fundamentally constrained by a persistent dichotomy: understanding and generation are treated as distinct problems, leading to fragmented...

490 views13:50

376 views13:50

🔥 δ-mem: Efficient Online Memory for Large Language Models

💡 The paper proposes a lightweight memory mechanism called delta-mem to enhance large language models by providing a compact online state of associative memory. The problem addressed is the need for large language models to accumulate and reuse historical information in long-term assistants and agent systems, which is challenging due to the high cost of expanding the context window and ineffective context utilization.

The proposed method, delta-mem, augments a frozen full-attention backbone with a compact online state that compresses past information into a fixed-size state matrix updated by delta-rule learning. This online state is used to generate low-rank corrections to the backbone's attention computation during generation, allowing for efficient online memory.

The results show that delta-mem improves the average score of the frozen backbone and achieves larger gains on memory-heavy benchmarks, such as MemoryAgentBench and LoCoMo, while preserving general capabilities. Notably, delta-mem achieves these results with only an 8x8 online memory state, demonstrating that effective memory can be realized through a compact online state directly coupled with attention computation, without requiring full fine-tuning, backbone replacement, or explicit context extension. Overall, the paper contributes a novel and efficient approach to enhancing large language models with online memory, which has the potential to improve performance in a range of applications.

📅 Published on May 12

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12357
• PDF: https://arxiv.org/pdf/2605.12357
• GitHub: https://github.com/declare-lab/delta-Mem ⭐ 46

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #AssociativeMemoryMechanisms #EfficientOnlineLearning #DeltaRuleLearning #CompactStateRepresentations

$δ$-mem: Efficient Online Memory for Large Language Models

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to...

587 views13:50