AI & ML Papers

UniVidX: A Unified Multimodal Framework for Versatile Video...

Recent progress has shown that video diffusion models (VDMs) can be repurposed for diverse multimodal graphics tasks. However, existing methods often train separate models for each problem...

186 views04:59

159 views04:59

🔥 MolmoAct2: Action Reasoning Models for Real-world Deployment

💡 The paper presents MolmoAct2, an open action reasoning model for robotics that improves upon previous systems in several ways. Current vision-language-action models aim to provide a single generalist controller for robots, but they have limitations, such as being closed, requiring expensive hardware, or having high latency. MolmoAct2 addresses these issues by introducing several new components, including a specialized vision-language-model backbone called MolmoER, which is trained on a large corpus of data and is designed for spatial and embodied reasoning. The model also includes three new datasets, including the largest open bimanual dataset to date, and an open-weight action tokenizer called OpenFAST. The architecture of the model has been redesigned to include a continuous-action expert and an adaptive-depth reasoning variant called MolmoThink, which reduces latency by only re-predicting depth tokens for scene regions that change between timesteps. The results of the paper show that MolmoAct2 outperforms strong baselines in several simulation and real-world benchmarks, and the model weights, training code, and training data are released for use by others. Overall, MolmoAct2 is a fully open action reasoning model that is designed for practical deployment and advances the state of the art in robotics.

MolmoAct2: Action Reasoning Models for Real-world Deployment

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models...

263 views04:59

227 views04:59

202 views04:59

🔥 HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

💡 The paper introduces HeavySkill, a framework that internalizes complex reasoning as a skill within a model's parameters, rather than relying on external orchestration. The problem with current approaches is that they use intricate system designs that obscure the underlying mechanism driving performance. HeavySkill proposes a two-stage pipeline consisting of parallel reasoning and summarization, which can operate beneath any agentic harness. The method involves identifying heavy thinking as an inner skill that can be learned and scaled via reinforcement learning. The authors conducted a systematic empirical study of HeavySkill across diverse domains and found that it consistently outperforms traditional Best-of-N strategies. The results show that stronger language models can even approach Pass@N performance, and that the depth and width of heavy thinking can be further scaled via reinforcement learning. This offers a promising path toward self-evolving language models that internalize complex reasoning without relying on brittle orchestration layers. Overall, the paper contributes a new perspective on complex reasoning, demonstrating that internalizing heavy thinking as a skill can lead to superior performance and more robust models.

📅 Published on May 4

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02396
• PDF: https://arxiv.org/pdf/2605.02396
• Project Page: https://github.com/wjn1996/HeavySkill
• GitHub: https://github.com/wjn1996/HeavySkill ⭐ 40

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticHarness #HeavyThinking #ReinforcementLearning #ComplexReasoning #InnerSkillMechanisms

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks....

220 views04:59

162 views04:59

🔥 PyTorch Distributed: Experiences on Accelerating Data Parallel Training

💡 The paper discusses the design and implementation of the PyTorch distributed data parallel module, which aims to optimize large-scale model training by scaling out to multiple computational resources. The need for this arises from the increasing demand for large datasets and models in deep learning research and applications. Data parallelism is a popular solution for distributed training, where the model is replicated on each resource to generate gradients independently, and then these gradients are communicated at each iteration to keep the model replicas consistent.

However, optimizing the distributed training efficiency is non-trivial due to the subtle dependencies between computation and communication. To address this, the PyTorch distributed data parallel module provides several techniques to accelerate distributed training, including gradient bucketing, computation-communication overlap, and selective synchronization.

The paper evaluates the effectiveness of these techniques and shows that when configured appropriately, the PyTorch distributed data parallel module can achieve near-linear scalability. This means that as the number of computational resources increases, the training time decreases proportionally, allowing for much faster training of large models. The evaluation results demonstrate that the module can achieve near-linear scalability using up to 256 GPUs, making it a highly effective solution for large-scale deep learning model training.

Overall, the paper contributes to the development of efficient distributed training methods, which is essential for the advancement of deep learning research and applications. The PyTorch distributed data parallel module provides a scalable and efficient solution for training large models, and its evaluation demonstrates the potential for significant speedups in training times.

📅 Published on Jun 28, 2020

🔗 Links:
• arXiv: https://arxiv.org/abs/2006.15704
• PDF: https://arxiv.org/pdf/2006.15704
• GitHub: https://github.com/pytorch/pytorch ⭐ 99.7k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#PyTorchDistributed #DataParallelTraining #DistributedDeepLearning #LargeScaleModelTraining #AcceleratedMachineLearning

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning...

171 views04:59

146 views05:00

🔥 Continuous Audio Language Models

💡 The paper introduces Continuous Audio Language Models, a new approach to audio generation that addresses the limitations of traditional discrete audio language models. Discrete models represent audio as sequences of discrete tokens, which are extracted from lossy codecs with limited bitrate, resulting in a trade-off between audio quality and computational cost. To overcome this issue, the authors propose Continuous Audio Language Models, which instantiate a large Transformer backbone that produces a contextual embedding at every time step. This sequential information then conditions a multilayer perceptron to generate the next continuous frame of an audio Variational Autoencoder through consistency modeling. By avoiding lossy compression, Continuous Audio Language Models achieve higher quality at lower computational cost than their discrete counterparts. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. The approach enables the generation of high-quality audio samples, which are made available for demonstration purposes. Overall, the paper contributes a novel method for continuous audio language modeling, which has the potential to improve the efficiency and quality of audio generation tasks.

📅 Published on Sep 8, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.06926
• PDF: https://arxiv.org/pdf/2509.06926
• Project Page: https://huggingface.co/spaces/kyutai/calm-samples
• GitHub: https://github.com/kyutai-labs/pocket-tts ⭐ 4.2k

🤖 Models citing this paper:
• https://huggingface.co/kyutai/pocket-tts
• https://huggingface.co/kyutai/pocket-tts-without-voice-cloning
• https://huggingface.co/Verylicious/pocket-tts-ungated

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/D3vShoaib/pocket-tts
• https://huggingface.co/spaces/kyutai/calm-samples
• https://huggingface.co/spaces/Xlnk/tts

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioLanguageModels #ContinuousAudioGeneration #TransformerBackbone #AudioVariationalAutoencoders #MultilayerPerceptron

Continuous Audio Language Models

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are...

266 views05:00

175 views05:00

🔥 PDFMathTranslate: Scientific Document Translation Preserving Layouts

💡 The paper introduces PDFMathTranslate, a software that enables the translation of scientific documents while preserving their original layouts. The problem addressed is that language barriers in scientific documents hinder the spread and development of science and technology, and previous translation efforts have largely ignored the importance of document layouts. To solve this, the authors developed PDFMathTranslate, which uses large language models and precise layout detection to translate documents accurately. The method leverages recent advances in these areas to improve precision, flexibility, and efficiency. The key contribution of the paper is the development of this open-source software, which has been made available to the community and has already gained significant attention with over 222,000 downloads. The results show that PDFMathTranslate is effective in translating scientific documents while preserving their layouts, making it a valuable tool for the scientific community.

📅 Published on Jul 2, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2507.03009
• PDF: https://arxiv.org/pdf/2507.03009
• GitHub: https://github.com/byaidu/pdfmathtranslate ⭐ 33.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ScientificDocumentTranslation #LanguageBarriersInScience #DocumentLayoutPreservation #MachineTranslationForScience #AcademicTextTranslation

PDFMathTranslate: Scientific Document Translation Preserving Layouts

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information...

235 views05:00

229 views05:00

🔥 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

💡 The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization, to improve the accuracy of modular AI systems that combine multiple language model calls and prompts. The problem addressed is that existing methods, such as GRPO, are not effective for optimizing language models in modular systems where multiple tasks are performed. The authors propose mmGRPO, which groups language model calls by module and handles variable-length and interrupted trajectories. The method is composed with automatic prompt optimization to further improve accuracy. The results show that mmGRPO improves accuracy by 11% on average across various tasks, including classification, many-hop search, and privacy-preserving delegation, compared to post-trained language models. Additionally, mmGRPO outperforms prompt optimization alone by 5%. The authors have open-sourced mmGRPO as the dspyGRPO optimizer, making it available for use in modular AI systems. Overall, the paper contributes a new method for optimizing language models in modular systems, which can lead to improved performance in a range of tasks.

📅 Published on Aug 6, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.04660
• PDF: https://arxiv.org/pdf/2508.04660
• Project Page: https://dspy.ai
• GitHub: https://github.com/stanfordnlp/dspy ⭐ 34.2k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiModuleLearning #LanguageModelOptimization #PolicyGradientMethods #ModularAISystems #PromptOptimizationTechniques

Composing Policy Gradients and Prompt Optimization for Language...

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix...

396 views05:00

❤1

374 views05:00

🔥 Adam's Law: Textual Frequency Law on Large Language Models

💡 The paper proposes a novel framework to improve large language model performance through textual frequency analysis. The authors argue that textual frequency, which is the frequency of certain words or phrases in a language, is relevant to human cognition and can also be applied to large language models. However, this topic has been understudied in the context of large language models.

The proposed framework consists of three main components. First, the authors introduce the Textual Frequency Law, which states that frequent textual data should be preferred for large language models, both for prompting and fine-tuning. To estimate the sentence-level frequency, the authors use online resources, as many large language models are closed-source in their training data. They also utilize an input paraphraser to paraphrase the input into a more frequent textual expression.

The second component is Textual Frequency Distillation, which involves querying large language models to conduct story completion by extending sentences in the datasets. The resulting corpora are used to adjust the initial estimation of textual frequency.

The third component is Curriculum Textual Frequency Training, which fine-tunes large language models in an increasing order of sentence-level frequency. This means that the models are first trained on the most frequent sentences and then gradually moved to less frequent ones.

The authors conducted experiments on a curated dataset called Textual Frequency Paired Dataset, which covers tasks such as math reasoning, machine translation, commonsense reasoning, and agentic tool calling. The results show that the proposed framework is effective in improving large language model performance.

Overall, the paper contributes to the understanding of textual frequency in large language models and provides a novel framework for improving their performance. The proposed framework has the potential to be applied to various natural language processing tasks and can lead to more efficient and effective large language models.

📅 Published on Apr 2

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.02176
• PDF: https://arxiv.org/pdf/2604.02176
• GitHub: https://github.com/HongyuanLuke/frequencylaw ⭐ 658

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Akaashiiii/TFPD

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdamSLaw #TextualFrequencyAnalysis #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelOptimization

Adam's Law: Textual Frequency Law on Large Language Models

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction...

❤2

521 views05:00

Forwarded from Machine Learning with Python

Unlock Your AI Career
Join our Data Science Full Stack with AI Course – a real-time, project-based online training designed for hands-on mastery.
Core Topics Covered
• Data Science using Python with Generative AI: Build end-to-end data pipelines, from data wrangling to deploying AI models with Python libraries like Pandas, Scikit-learn, and Hugging Face transformers.
• Prompt Engineering: Craft precise prompts to maximize output from models like GPT and Gemini for accurate, creative results.
• AI Agents & Agentic AI: Develop autonomous agents that reason, plan, and act using frameworks like Lang Chain for real-world automation.
Why Choose This Course?
This training emphasizes live sessions, industry projects, and practical skills for immediate job impact, similar to top programs offering 100+ hours of Python-to-AI progression.
Ready to start? Call/WhatsApp: (+91)-7416877757
WhatsApp Link:-
http://wa.me/+917416877757

❤1

254 views14:42

🔥 RLDX-1 Technical Report

💡 The paper introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation that addresses the limitations of existing vision-language-action models. These models have shown progress in human-like generalist robotic policies but struggle with complex real-world tasks that require broader functional capabilities such as motion awareness, memory-aware decision making, and physical sensing. To overcome this, RLDX-1 uses a Multi-Stream Action Transformer architecture that integrates heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. This architecture is combined with system-level design choices including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. The results show that RLDX-1 outperforms recent frontier vision-language-action models across both simulation benchmarks and real-world tasks, achieving success rates of 86.8 percent in ALLEX humanoid tasks compared to around 40 percent for other models. This positions RLDX-1 as a promising step toward reliable vision-language-action models for complex and dynamic real-world dexterous manipulation. The method and results demonstrate the ability of RLDX-1 to control a high-degree-of-freedom humanoid robot under diverse functional demands, highlighting its potential for complex real-world tasks.

📅 Published on May 5

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.03269
• PDF: https://arxiv.org/pdf/2605.03269
• Project Page: http://rlwrld.ai/rldx-1
• GitHub: https://github.com/RLWRLD/RLDX-1 ⭐ 75

🤖 Models citing this paper:
• https://huggingface.co/RLWRLD/RLDX-1-PT
• https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA
• https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticManipulation #DexterousRobotics #VisionLanguageAction #MultiModalLearning #RobotPolicyLearning

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and...

RLDX-1 Technical Report

❤2

400 views17:34

This media is not supported in your browser

0:18

VIEW IN TELEGRAM

259 views17:34