AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
🔥 UniVidX: A Unified Multimodal Framework for Versatile Video Generation via Diffusion Priors

💡 The paper introduces UniVidX, a unified multimodal framework for versatile video generation using video diffusion model priors. The problem with existing methods is that they train separate models for each task, limiting the modeling of correlations across different modalities. UniVidX addresses this issue by formulating pixel-aligned tasks as conditional generation in a shared multimodal space, allowing it to adapt to modality-specific distributions while preserving the native priors of the video diffusion model.

The framework consists of three key designs: Stochastic Condition Masking, Decoupled Gated LoRA, and Cross-Modal Self-Attention. Stochastic Condition Masking enables omni-directional conditional generation by randomly partitioning modalities into clean conditions and noisy targets during training. Decoupled Gated LoRA preserves the strong priors of the video diffusion model by introducing per-modality LoRAs that are activated when a modality serves as the generation target. Cross-Modal Self-Attention facilitates information exchange and inter-modal alignment by sharing keys and values across modalities while keeping modality-specific queries.

The authors instantiate UniVidX in two domains: UniVid-Intrinsic for RGB videos and intrinsic maps, and UniVid-Alpha for blended RGB videos and their constituent RGBA layers. The results show that both models achieve performance competitive with state-of-the-art methods across distinct tasks and generalize robustly to in-the-wild scenarios, even when trained on fewer than 1000 videos. Overall, UniVidX provides a unified framework for versatile video generation, allowing for more efficient and effective modeling of correlations across different modalities.


📅 Published on May 1

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.00658
• PDF: https://arxiv.org/pdf/2605.00658
• Project Page: https://houyuanchen111.github.io/UniVidX.github.io/
• GitHub: https://github.com/houyuanchen111/UniVidX 93

🤖 Models citing this paper:
https://huggingface.co/houyuanchen/UniVidX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalVideoGeneration #VideoDiffusionModels #ConditionalGeneration #CrossModalLearning #MultimodalFusionArchitectures
🔥 MolmoAct2: Action Reasoning Models for Real-world Deployment

💡 The paper presents MolmoAct2, an open action reasoning model for robotics that improves upon previous systems in several ways. Current vision-language-action models aim to provide a single generalist controller for robots, but they have limitations, such as being closed, requiring expensive hardware, or having high latency. MolmoAct2 addresses these issues by introducing several new components, including a specialized vision-language-model backbone called MolmoER, which is trained on a large corpus of data and is designed for spatial and embodied reasoning. The model also includes three new datasets, including the largest open bimanual dataset to date, and an open-weight action tokenizer called OpenFAST. The architecture of the model has been redesigned to include a continuous-action expert and an adaptive-depth reasoning variant called MolmoThink, which reduces latency by only re-predicting depth tokens for scene regions that change between timesteps. The results of the paper show that MolmoAct2 outperforms strong baselines in several simulation and real-world benchmarks, and the model weights, training code, and training data are released for use by others. Overall, MolmoAct2 is a fully open action reasoning model that is designed for practical deployment and advances the state of the art in robotics.


📅 Published on May 4

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02881
• PDF: https://arxiv.org/pdf/2605.02881
• Project Page: https://allenai.org/blog/molmoact2
• GitHub: https://github.com/allenai/molmoact2 90

🤖 Models citing this paper:
https://huggingface.co/allenai/MolmoAct2
https://huggingface.co/allenai/MolmoAct2-SO100_101
https://huggingface.co/allenai/Molmo2-ER

📊 Datasets citing this paper:
https://huggingface.co/datasets/allenai/13122025-tool-04
https://huggingface.co/datasets/allenai/13122025-cut-10
https://huggingface.co/datasets/allenai/28112025-yam-01

🚀 Spaces citing this paper:
https://huggingface.co/spaces/allenai/dataset-stats
https://huggingface.co/spaces/allenai/lerobot-visualizer-v3

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticsActionReasoning #VisionLanguageModels #EmbodiedAI #BimanualRobotics #SpatialReasoning
AI & ML Papers
Photo
🔥 HeavySkill: Heavy Thinking as the Inner Skill in Agentic Harness

💡 The paper introduces HeavySkill, a framework that internalizes complex reasoning as a skill within a model's parameters, rather than relying on external orchestration. The problem with current approaches is that they use intricate system designs that obscure the underlying mechanism driving performance. HeavySkill proposes a two-stage pipeline consisting of parallel reasoning and summarization, which can operate beneath any agentic harness. The method involves identifying heavy thinking as an inner skill that can be learned and scaled via reinforcement learning. The authors conducted a systematic empirical study of HeavySkill across diverse domains and found that it consistently outperforms traditional Best-of-N strategies. The results show that stronger language models can even approach Pass@N performance, and that the depth and width of heavy thinking can be further scaled via reinforcement learning. This offers a promising path toward self-evolving language models that internalize complex reasoning without relying on brittle orchestration layers. Overall, the paper contributes a new perspective on complex reasoning, demonstrating that internalizing heavy thinking as a skill can lead to superior performance and more robust models.


📅 Published on May 4

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.02396
• PDF: https://arxiv.org/pdf/2605.02396
• Project Page: https://github.com/wjn1996/HeavySkill
• GitHub: https://github.com/wjn1996/HeavySkill 40

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticHarness #HeavyThinking #ReinforcementLearning #ComplexReasoning #InnerSkillMechanisms
AI & ML Papers
Photo
🔥 PyTorch Distributed: Experiences on Accelerating Data Parallel Training

💡 The paper discusses the design and implementation of the PyTorch distributed data parallel module, which aims to optimize large-scale model training by scaling out to multiple computational resources. The need for this arises from the increasing demand for large datasets and models in deep learning research and applications. Data parallelism is a popular solution for distributed training, where the model is replicated on each resource to generate gradients independently, and then these gradients are communicated at each iteration to keep the model replicas consistent.

However, optimizing the distributed training efficiency is non-trivial due to the subtle dependencies between computation and communication. To address this, the PyTorch distributed data parallel module provides several techniques to accelerate distributed training, including gradient bucketing, computation-communication overlap, and selective synchronization.

The paper evaluates the effectiveness of these techniques and shows that when configured appropriately, the PyTorch distributed data parallel module can achieve near-linear scalability. This means that as the number of computational resources increases, the training time decreases proportionally, allowing for much faster training of large models. The evaluation results demonstrate that the module can achieve near-linear scalability using up to 256 GPUs, making it a highly effective solution for large-scale deep learning model training.

Overall, the paper contributes to the development of efficient distributed training methods, which is essential for the advancement of deep learning research and applications. The PyTorch distributed data parallel module provides a scalable and efficient solution for training large models, and its evaluation demonstrates the potential for significant speedups in training times.


📅 Published on Jun 28, 2020

🔗 Links:
• arXiv: https://arxiv.org/abs/2006.15704
• PDF: https://arxiv.org/pdf/2006.15704
• GitHub: https://github.com/pytorch/pytorch 99.7k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#PyTorchDistributed #DataParallelTraining #DistributedDeepLearning #LargeScaleModelTraining #AcceleratedMachineLearning
AI & ML Papers
Photo
🔥 Continuous Audio Language Models

💡 The paper introduces Continuous Audio Language Models, a new approach to audio generation that addresses the limitations of traditional discrete audio language models. Discrete models represent audio as sequences of discrete tokens, which are extracted from lossy codecs with limited bitrate, resulting in a trade-off between audio quality and computational cost. To overcome this issue, the authors propose Continuous Audio Language Models, which instantiate a large Transformer backbone that produces a contextual embedding at every time step. This sequential information then conditions a multilayer perceptron to generate the next continuous frame of an audio Variational Autoencoder through consistency modeling. By avoiding lossy compression, Continuous Audio Language Models achieve higher quality at lower computational cost than their discrete counterparts. Experiments on speech and music demonstrate improved efficiency and fidelity over state-of-the-art discrete audio language models, facilitating lightweight, high-quality audio generation. The approach enables the generation of high-quality audio samples, which are made available for demonstration purposes. Overall, the paper contributes a novel method for continuous audio language modeling, which has the potential to improve the efficiency and quality of audio generation tasks.


📅 Published on Sep 8, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.06926
• PDF: https://arxiv.org/pdf/2509.06926
• Project Page: https://huggingface.co/spaces/kyutai/calm-samples
• GitHub: https://github.com/kyutai-labs/pocket-tts 4.2k

🤖 Models citing this paper:
https://huggingface.co/kyutai/pocket-tts
https://huggingface.co/kyutai/pocket-tts-without-voice-cloning
https://huggingface.co/Verylicious/pocket-tts-ungated

🚀 Spaces citing this paper:
https://huggingface.co/spaces/D3vShoaib/pocket-tts
https://huggingface.co/spaces/kyutai/calm-samples
https://huggingface.co/spaces/Xlnk/tts

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioLanguageModels #ContinuousAudioGeneration #TransformerBackbone #AudioVariationalAutoencoders #MultilayerPerceptron
AI & ML Papers
Photo
🔥 PDFMathTranslate: Scientific Document Translation Preserving Layouts

💡 The paper introduces PDFMathTranslate, a software that enables the translation of scientific documents while preserving their original layouts. The problem addressed is that language barriers in scientific documents hinder the spread and development of science and technology, and previous translation efforts have largely ignored the importance of document layouts. To solve this, the authors developed PDFMathTranslate, which uses large language models and precise layout detection to translate documents accurately. The method leverages recent advances in these areas to improve precision, flexibility, and efficiency. The key contribution of the paper is the development of this open-source software, which has been made available to the community and has already gained significant attention with over 222,000 downloads. The results show that PDFMathTranslate is effective in translating scientific documents while preserving their layouts, making it a valuable tool for the scientific community.


📅 Published on Jul 2, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2507.03009
• PDF: https://arxiv.org/pdf/2507.03009
• GitHub: https://github.com/byaidu/pdfmathtranslate 33.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ScientificDocumentTranslation #LanguageBarriersInScience #DocumentLayoutPreservation #MachineTranslationForScience #AcademicTextTranslation
AI & ML Papers
Photo
🔥 Multi-module GRPO: Composing Policy Gradients and Prompt Optimization for Language Model Programs

💡 The paper introduces mmGRPO, a multi-module extension of Group Relative Policy Optimization, to improve the accuracy of modular AI systems that combine multiple language model calls and prompts. The problem addressed is that existing methods, such as GRPO, are not effective for optimizing language models in modular systems where multiple tasks are performed. The authors propose mmGRPO, which groups language model calls by module and handles variable-length and interrupted trajectories. The method is composed with automatic prompt optimization to further improve accuracy. The results show that mmGRPO improves accuracy by 11% on average across various tasks, including classification, many-hop search, and privacy-preserving delegation, compared to post-trained language models. Additionally, mmGRPO outperforms prompt optimization alone by 5%. The authors have open-sourced mmGRPO as the dspyGRPO optimizer, making it available for use in modular AI systems. Overall, the paper contributes a new method for optimizing language models in modular systems, which can lead to improved performance in a range of tasks.


📅 Published on Aug 6, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.04660
• PDF: https://arxiv.org/pdf/2508.04660
• Project Page: https://dspy.ai
• GitHub: https://github.com/stanfordnlp/dspy 34.2k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiModuleLearning #LanguageModelOptimization #PolicyGradientMethods #ModularAISystems #PromptOptimizationTechniques
1
AI & ML Papers
Photo
🔥 Adam's Law: Textual Frequency Law on Large Language Models

💡 The paper proposes a novel framework to improve large language model performance through textual frequency analysis. The authors argue that textual frequency, which is the frequency of certain words or phrases in a language, is relevant to human cognition and can also be applied to large language models. However, this topic has been understudied in the context of large language models.

The proposed framework consists of three main components. First, the authors introduce the Textual Frequency Law, which states that frequent textual data should be preferred for large language models, both for prompting and fine-tuning. To estimate the sentence-level frequency, the authors use online resources, as many large language models are closed-source in their training data. They also utilize an input paraphraser to paraphrase the input into a more frequent textual expression.

The second component is Textual Frequency Distillation, which involves querying large language models to conduct story completion by extending sentences in the datasets. The resulting corpora are used to adjust the initial estimation of textual frequency.

The third component is Curriculum Textual Frequency Training, which fine-tunes large language models in an increasing order of sentence-level frequency. This means that the models are first trained on the most frequent sentences and then gradually moved to less frequent ones.

The authors conducted experiments on a curated dataset called Textual Frequency Paired Dataset, which covers tasks such as math reasoning, machine translation, commonsense reasoning, and agentic tool calling. The results show that the proposed framework is effective in improving large language model performance.

Overall, the paper contributes to the understanding of textual frequency in large language models and provides a novel framework for improving their performance. The proposed framework has the potential to be applied to various natural language processing tasks and can lead to more efficient and effective large language models.


📅 Published on Apr 2

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.02176
• PDF: https://arxiv.org/pdf/2604.02176
• GitHub: https://github.com/HongyuanLuke/frequencylaw 658

📊 Datasets citing this paper:
https://huggingface.co/datasets/Akaashiiii/TFPD

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdamSLaw #TextualFrequencyAnalysis #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelOptimization
2
Unlock Your AI Career
Join our Data Science Full Stack with AI Course – a real-time, project-based online training designed for hands-on mastery.
Core Topics Covered
•  Data Science using Python with Generative AI: Build end-to-end data pipelines, from data wrangling to deploying AI models with Python libraries like Pandas, Scikit-learn, and Hugging Face transformers.
•  Prompt Engineering: Craft precise prompts to maximize output from models like GPT and Gemini for accurate, creative results.
•  AI Agents & Agentic AI: Develop autonomous agents that reason, plan, and act using frameworks like Lang Chain for real-world automation.
Why Choose This Course?
This training emphasizes live sessions, industry projects, and practical skills for immediate job impact, similar to top programs offering 100+ hours of Python-to-AI progression.
Ready to start? Call/WhatsApp: (+91)-7416877757
WhatsApp Link:-
http://wa.me/+917416877757
1
🔥 RLDX-1 Technical Report

💡 The paper introduces RLDX-1, a general-purpose robotic policy for dexterous manipulation that addresses the limitations of existing vision-language-action models. These models have shown progress in human-like generalist robotic policies but struggle with complex real-world tasks that require broader functional capabilities such as motion awareness, memory-aware decision making, and physical sensing. To overcome this, RLDX-1 uses a Multi-Stream Action Transformer architecture that integrates heterogeneous modalities through modality-specific streams with cross-modal joint self-attention. This architecture is combined with system-level design choices including synthesizing training data for rare manipulation scenarios, learning procedures specialized for human-like manipulation, and inference optimizations for real-time deployment. The results show that RLDX-1 outperforms recent frontier vision-language-action models across both simulation benchmarks and real-world tasks, achieving success rates of 86.8 percent in ALLEX humanoid tasks compared to around 40 percent for other models. This positions RLDX-1 as a promising step toward reliable vision-language-action models for complex and dynamic real-world dexterous manipulation. The method and results demonstrate the ability of RLDX-1 to control a high-degree-of-freedom humanoid robot under diverse functional demands, highlighting its potential for complex real-world tasks.


📅 Published on May 5

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.03269
• PDF: https://arxiv.org/pdf/2605.03269
• Project Page: http://rlwrld.ai/rldx-1
• GitHub: https://github.com/RLWRLD/RLDX-1 75

🤖 Models citing this paper:
https://huggingface.co/RLWRLD/RLDX-1-PT
https://huggingface.co/RLWRLD/RLDX-1-FT-ROBOCASA
https://huggingface.co/RLWRLD/RLDX-1-MT-ALLEX

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RoboticManipulation #DexterousRobotics #VisionLanguageAction #MultiModalLearning #RobotPolicyLearning
2
🔥 PhysForge: Generating Physics-Grounded 3D Assets for Interactive Virtual World

💡 The paper introduces PhysForge, a system for generating interactive 3D assets that combines visual-language modeling with a physics-grounded diffusion model. The problem addressed is the lack of functional properties in existing methods for generating 3D assets, which focus on static geometry and overlook the need for interactive virtual worlds and embodied AI. To solve this, PhysForge uses a two-stage framework, first using a visual-language model to plan a hierarchical physical blueprint that defines material, functional, and kinematic constraints. Then, a physics-grounded diffusion model synthesizes high-fidelity geometry and precise kinematic parameters using a novel injection mechanism called KineVoxel Injection. The system is supported by PhysDB, a large-scale dataset of 150,000 assets with physical annotations. The results show that PhysForge produces functionally plausible and simulation-ready assets, providing a robust data engine for interactive 3D content and embodied agents. Overall, PhysForge contributes a new approach to generating physics-grounded 3D assets that can be used in interactive virtual worlds and embodied AI applications.


📅 Published on May 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.05163
• PDF: https://arxiv.org/pdf/2605.05163
• Project Page: https://hku-mmlab.github.io/PhysForge/
• GitHub: https://github.com/HKU-MMLab/PhysForge 41

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#PhysicsGroundedModeling #InteractiveVirtualWorlds #3DAssetGeneration #EmbodiedAI #PhysicsBasedRendering
2