AI & ML Papers

🔥 MolmoAct2: Action Reasoning Models for Real-world Deployment

💡 The paper presents MolmoAct2, an open action reasoning model for robotics that improves upon previous systems in several ways. Current vision-language-action models aim to provide a single generalist controller for robots, but they have limitations, such as being closed, requiring expensive hardware, or having high latency. MolmoAct2 addresses these issues by introducing several new components, including a specialized vision-language-model backbone called MolmoER, which is trained on a large corpus of data and is designed for spatial and embodied reasoning. The model also includes three new datasets, including the largest open bimanual dataset to date, and an open-weight action tokenizer called OpenFAST. The architecture of the model has been redesigned to include a continuous-action expert and an adaptive-depth reasoning variant called MolmoThink, which reduces latency by only re-predicting depth tokens for scene regions that change between timesteps. The results of the paper show that MolmoAct2 outperforms strong baselines in several simulation and real-world benchmarks, and the model weights, training code, and training data are released for use by others. Overall, MolmoAct2 is a fully open action reasoning model that is designed for practical deployment and advances the state of the art in robotics.

arXiv.org

MolmoAct2: Action Reasoning Models for Real-world Deployment

Vision-Language-Action (VLA) models aim to provide a single generalist controller for robots, but today's systems fall short on the criteria that matter for real-world deployment. Frontier models...

263 views04:59

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

227 views04:59

202 views04:59

🔥 HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

💡 The paper introduces HeavySkill, a framework that internalizes complex reasoning as a skill within a model's parameters, rather than relying on external orchestration. The problem with current approaches is that they use intricate system designs that obscure the underlying mechanism driving performance. HeavySkill proposes a two-stage pipeline consisting of parallel reasoning and summarization, which can operate beneath any agentic harness. The method involves identifying heavy thinking as an inner skill that can be learned and scaled via reinforcement learning. The authors conducted a systematic empirical study of HeavySkill across diverse domains and found that it consistently outperforms traditional Best-of-N strategies. The results show that stronger language models can even approach Pass@N performance, and that the depth and width of heavy thinking can be further scaled via reinforcement learning. This offers a promising path toward self-evolving language models that internalize complex reasoning without relying on brittle orchestration layers. Overall, the paper contributes a new perspective on complex reasoning, demonstrating that internalizing heavy thinking as a skill can lead to superior performance and more robust models.

📅 Published on May 4

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02396
• PDF: https://arxiv.org/pdf/2605.02396
• Project Page: https://github.com/wjn1996/HeavySkill
• GitHub: https://github.com/wjn1996/HeavySkill ⭐ 40

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticHarness #HeavyThinking #ReinforcementLearning #ComplexReasoning #InnerSkillMechanisms

arXiv.org

HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

Recent advances in agentic harness with orchestration frameworks that coordinate multiple agents with memory, skills, and tool use have achieved remarkable success in complex reasoning tasks....

219 views04:59

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

161 views04:59

AI & ML Papers

Photo

🔥 PyTorch Distributed: Experiences on Accelerating Data Parallel Training

💡 The paper discusses the design and implementation of the PyTorch distributed data parallel module, which aims to optimize large-scale model training by scaling out to multiple computational resources. The need for this arises from the increasing demand for large datasets and models in deep learning research and applications. Data parallelism is a popular solution for distributed training, where the model is replicated on each resource to generate gradients independently, and then these gradients are communicated at each iteration to keep the model replicas consistent.

However, optimizing the distributed training efficiency is non-trivial due to the subtle dependencies between computation and communication. To address this, the PyTorch distributed data parallel module provides several techniques to accelerate distributed training, including gradient bucketing, computation-communication overlap, and selective synchronization.

The paper evaluates the effectiveness of these techniques and shows that when configured appropriately, the PyTorch distributed data parallel module can achieve near-linear scalability. This means that as the number of computational resources increases, the training time decreases proportionally, allowing for much faster training of large models. The evaluation results demonstrate that the module can achieve near-linear scalability using up to 256 GPUs, making it a highly effective solution for large-scale deep learning model training.

Overall, the paper contributes to the development of efficient distributed training methods, which is essential for the advancement of deep learning research and applications. The PyTorch distributed data parallel module provides a scalable and efficient solution for training large models, and its evaluation demonstrates the potential for significant speedups in training times.

📅 Published on Jun 28, 2020

🔗 Links:
• arXiv: https://arxiv.org/abs/2006.15704
• PDF: https://arxiv.org/pdf/2006.15704
• GitHub: https://github.com/pytorch/pytorch ⭐ 99.7k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#PyTorchDistributed #DataParallelTraining #DistributedDeepLearning #LargeScaleModelTraining #AcceleratedMachineLearning

arXiv.org

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the PyTorch distributed data parallel module. PyTorch is a widely-adopted scientific computing package used in deep learning...

171 views04:59

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

146 views05:00

AI & ML Papers

Photo

🔥 Continuous Audio Language Models

💡 The paper introduces Continuous Audio Language Models, a new approach to audio generation that addresses the limitations of traditional discrete audio language models. Discrete models represent audio as sequences of discrete tokens, which are extracted from lossy codecs with limited bitrate, resulting in a trade-off between audio quality and computational cost. To overcome this issue, the authors propose Continuous Audio Language Models, which instantiate a large Transformer backbone that produces a contextual embedding at every time step. This sequential information then conditions a multilayer perceptron to generate the next continuous frame of an audio Variational Autoencoder through consistency modeling. By avoiding lossy compression, Continuous Audio Language Models achieve higher quality at lower computational cost than their discrete counterparts. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. The approach enables the generation of high-quality audio samples, which are made available for demonstration purposes. Overall, the paper contributes a novel method for continuous audio language modeling, which has the potential to improve the efficiency and quality of audio generation tasks.

📅 Published on Sep 8, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.06926
• PDF: https://arxiv.org/pdf/2509.06926
• Project Page: https://huggingface.co/spaces/kyutai/calm-samples
• GitHub: https://github.com/kyutai-labs/pocket-tts ⭐ 4.2k

🤖 Models citing this paper:
• https://huggingface.co/kyutai/pocket-tts
• https://huggingface.co/kyutai/pocket-tts-without-voice-cloning
• https://huggingface.co/Verylicious/pocket-tts-ungated

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/D3vShoaib/pocket-tts
• https://huggingface.co/spaces/kyutai/calm-samples
• https://huggingface.co/spaces/Xlnk/tts

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioLanguageModels #ContinuousAudioGeneration #TransformerBackbone #AudioVariationalAutoencoders #MultilayerPerceptron

arXiv.org

Continuous Audio Language Models

Audio Language Models (ALM) have emerged as the dominant paradigm for speech and music generation by representing audio as sequences of discrete tokens. Yet, unlike text tokens, which are...

265 views05:00

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

174 views05:00

AI & ML Papers

Photo

🔥 PDFMathTranslate: Scientific Document Translation Preserving Layouts

💡 The paper introduces PDFMathTranslate, a software that enables the translation of scientific documents while preserving their original layouts. The problem addressed is that language barriers in scientific documents hinder the spread and development of science and technology, and previous translation efforts have largely ignored the importance of document layouts. To solve this, the authors developed PDFMathTranslate, which uses large language models and precise layout detection to translate documents accurately. The method leverages recent advances in these areas to improve precision, flexibility, and efficiency. The key contribution of the paper is the development of this open-source software, which has been made available to the community and has already gained significant attention with over 222,000 downloads. The results show that PDFMathTranslate is effective in translating scientific documents while preserving their layouts, making it a valuable tool for the scientific community.

📅 Published on Jul 2, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2507.03009
• PDF: https://arxiv.org/pdf/2507.03009
• GitHub: https://github.com/byaidu/pdfmathtranslate ⭐ 33.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ScientificDocumentTranslation #LanguageBarriersInScience #DocumentLayoutPreservation #MachineTranslationForScience #AcademicTextTranslation

arXiv.org

PDFMathTranslate: Scientific Document Translation Preserving Layouts

Language barriers in scientific documents hinder the diffusion and development of science and technologies. However, prior efforts in translating such documents largely overlooked the information...

235 views05:00

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

229 views05:00

AI & ML Papers

Photo

🔥 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

💡 The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization, to improve the accuracy of modular AI systems that combine multiple language model calls and prompts. The problem addressed is that existing methods, such as GRPO, are not effective for optimizing language models in modular systems where multiple tasks are performed. The authors propose mmGRPO, which groups language model calls by module and handles variable-length and interrupted trajectories. The method is composed with automatic prompt optimization to further improve accuracy. The results show that mmGRPO improves accuracy by 11% on average across various tasks, including classification, many-hop search, and privacy-preserving delegation, compared to post-trained language models. Additionally, mmGRPO outperforms prompt optimization alone by 5%. The authors have open-sourced mmGRPO as the dspyGRPO optimizer, making it available for use in modular AI systems. Overall, the paper contributes a new method for optimizing language models in modular systems, which can lead to improved performance in a range of tasks.

📅 Published on Aug 6, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.04660
• PDF: https://arxiv.org/pdf/2508.04660
• Project Page: https://dspy.ai
• GitHub: https://github.com/stanfordnlp/dspy ⭐ 34.2k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiModuleLearning #LanguageModelOptimization #PolicyGradientMethods #ModularAISystems #PromptOptimizationTechniques

arXiv.org

Composing Policy Gradients and Prompt Optimization for Language...

Group Relative Policy Optimization (GRPO) has proven to be an effective tool for post-training language models (LMs). However, AI systems are increasingly expressed as modular programs that mix...

395 views05:00

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

❤1

373 views05:00

AI & ML Papers

Photo

🔥 Adam's Law: Textual Frequency Law on Large Language Models

💡 The paper proposes a novel framework to improve large language model performance through textual frequency analysis. The authors argue that textual frequency, which is the frequency of certain words or phrases in a language, is relevant to human cognition and can also be applied to large language models. However, this topic has been understudied in the context of large language models.

The proposed framework consists of three main components. First, the authors introduce the Textual Frequency Law, which states that frequent textual data should be preferred for large language models, both for prompting and fine-tuning. To estimate the sentence-level frequency, the authors use online resources, as many large language models are closed-source in their training data. They also utilize an input paraphraser to paraphrase the input into a more frequent textual expression.

The second component is Textual Frequency Distillation, which involves querying large language models to conduct story completion by extending sentences in the datasets. The resulting corpora are used to adjust the initial estimation of textual frequency.

The third component is Curriculum Textual Frequency Training, which fine-tunes large language models in an increasing order of sentence-level frequency. This means that the models are first trained on the most frequent sentences and then gradually moved to less frequent ones.

The authors conducted experiments on a curated dataset called Textual Frequency Paired Dataset, which covers tasks such as math reasoning, machine translation, commonsense reasoning, and agentic tool calling. The results show that the proposed framework is effective in improving large language model performance.

Overall, the paper contributes to the understanding of textual frequency in large language models and provides a novel framework for improving their performance. The proposed framework has the potential to be applied to various natural language processing tasks and can lead to more efficient and effective large language models.

📅 Published on Apr 2

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.02176
• PDF: https://arxiv.org/pdf/2604.02176
• GitHub: https://github.com/HongyuanLuke/frequencylaw ⭐ 658

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Akaashiiii/TFPD

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdamSLaw #TextualFrequencyAnalysis #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelOptimization

arXiv.org

Adam's Law: Textual Frequency Law on Large Language Models

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction...

❤2

520 views05:00

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Forwarded from Machine Learning with Python

Unlock Your AI Career
Join our Data Science Full Stack with AI Course – a real-time, project-based online training designed for hands-on mastery.
Core Topics Covered
• Data Science using Python with Generative AI: Build end-to-end data pipelines, from data wrangling to deploying AI models with Python libraries like Pandas, Scikit-learn, and Hugging Face transformers.
• Prompt Engineering: Craft precise prompts to maximize output from models like GPT and Gemini for accurate, creative results.
• AI Agents & Agentic AI: Develop autonomous agents that reason, plan, and act using frameworks like Lang Chain for real-world automation.
Why Choose This Course?
This training emphasizes live sessions, industry projects, and practical skills for immediate job impact, similar to top programs offering 100+ hours of Python-to-AI progression.
Ready to start? Call/WhatsApp: (+91)-7416877757
WhatsApp Link:-
http://wa.me/+917416877757

❤1

253 views14:42

AI & ML Papers

🔥 RLDX-1 Technical Report

💡 The paper introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation that addresses the limitations of existing vision-language-action models. These models have shown progress in human-like generalist robotic policies but struggle with complex real-world tasks that require broader functional capabilities such as motion awareness, memory-aware decision making, and physical sensing. To overcome this, RLDX-1 uses a Multi-Stream Action Transformer architecture that integrates heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. This architecture is combined with system-level design choices including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. The results show that RLDX-1 outperforms recent frontier vision-language-action models across both simulation benchmarks and real-world tasks, achieving success rates of 86.8 percent in ALLEX humanoid tasks compared to around 40 percent for other models. This positions RLDX-1 as a promising step toward reliable vision-language-action models for complex and dynamic real-world dexterous manipulation. The method and results demonstrate the ability of RLDX-1 to control a high-degree-of-freedom humanoid robot under diverse functional demands, highlighting its potential for complex real-world tasks.

📅 Published on May 5

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.03269
• PDF: https://arxiv.org/pdf/2605.03269
• Project Page: http://rlwrld.ai/rldx-1
• GitHub: https://github.com/RLWRLD/RLDX-1 ⭐ 75

🤖 Models citing this paper:
• https://huggingface.co/RLWRLD/RLDX-1-PT
• https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA
• https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticManipulation #DexterousRobotics #VisionLanguageAction #MultiModalLearning #RobotPolicyLearning

arXiv.org

RLDX-1 Technical Report

While Vision-Language-Action models (VLAs) have shown remarkable progress toward human-like generalist robotic policies through the versatile intelligence (i.e. broad scene understanding and...

❤2

400 views17:34

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

0:18

This media is not supported in your browser

VIEW IN TELEGRAM

259 views17:34

AI & ML Papers

🔥 PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

💡 The paper introduces PhysForge, a system for generating interactive 3D assets that combines visual-language modeling with a physics-grounded diffusion model. The problem addressed is the lack of functional properties in existing methods for generating 3D assets, which focus on static geometry and overlook the need for interactive virtual worlds and embodied AI. To solve this, PhysForge uses a two-stage framework, first using a visual-language model to plan a hierarchical physical blueprint that defines material, functional, and kinematic constraints. Then, a physics-grounded diffusion model synthesizes high-fidelity geometry and precise kinematic parameters using a novel injection mechanism called KineVoxel Injection. The system is supported by PhysDB, a large-scale dataset of 150,000 assets with physical annotations. The results show that PhysForge produces functionally plausible and simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents. Overall, PhysForge contributes a new approach to generating physics-grounded 3D assets that can be used in interactive virtual worlds and embodied AI applications.

📅 Published on May 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.05163
• PDF: https://arxiv.org/pdf/2605.05163
• Project Page: https://hku-mmlab.github.io/PhysForge/
• GitHub: https://github.com/HKU-MMLab/PhysForge ⭐ 41

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#PhysicsGroundedModeling #InteractiveVirtualWorlds #3DAssetGeneration #EmbodiedAI #PhysicsBasedRendering

arXiv.org

PhysForge: Generating Physics-Grounded 3D Assets for Interactive...

Synthesizing physics-grounded 3D assets is a critical bottleneck for interactive virtual worlds and embodied AI. Existing methods predominantly focus on static geometry, overlooking the functional...

❤2

413 views17:34

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

319 views17:34

245 views17:34

🔥 D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

💡 The paper introduces D-OPSD, a new training approach for diffusion models that enables efficient supervised fine-tuning while preserving few-step inference capabilities. The current landscape of high-performance image generation models is shifting from inefficient multi-step models to efficient few-step models, but these models are challenging to fine-tune using traditional techniques. The problem with traditional fine-tuning methods is that they compromise the model's inherent few-step inference capability.

To address this issue, the authors propose D-OPSD, which leverages on-policy self-distillation with text and multimodal features. The method works by making the model act as both the teacher and the student, where the student is conditioned only on the text feature, and the teacher is conditioned on the multimodal feature of both the text prompt and the target image. The training process minimizes the difference between the predicted distributions over the student's own roll-outs, allowing the model to learn new concepts and styles without sacrificing its original few-step capacity.

The key contribution of D-OPSD is that it enables on-policy learning during supervised fine-tuning, which allows the model to learn from its own trajectory and under its own supervision. This approach enables the model to inherit the in-context capabilities of its encoder, making it possible to fine-tune the model continuously without compromising its few-step inference capability. The results show that D-OPSD enables efficient supervised fine-tuning for diffusion models, making it a promising approach for high-performance image generation models.

📅 Published on May 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.05204
• PDF: https://arxiv.org/pdf/2605.05204
• Project Page: https://vvvvvjdy.github.io/d-opsd/
• GitHub: https://github.com/vvvvvjdy/D-OPSD ⭐ 24

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionModels #SelfDistillation #FewShotLearning #ImageGeneration #MultimodalLearning

arXiv.org

D-OPSD: On-Policy Self-Distillation for Continuously Tuning...

The landscape of high-performance image generation models is currently shifting from the inefficient multi-step ones to the efficient few-step counterparts (e.g, Z-Image-Turbo and FLUX.2-klein)....

❤2

342 views17:34

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform