AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 Lance: Unified Multimodal Modeling by Multi-Task Synergy

💡 The paper introduces Lance, a unified multimodal model that combines understanding, generation, and editing capabilities for images and videos. The goal is to develop a model that can handle multiple tasks without relying on large model capacity or focusing on specific modalities like text or images. Lance achieves this through a dual-stream architecture and collaborative multi-task training, which enables joint context learning while separating the pathways for understanding and generation.

The model uses a mixture-of-experts architecture on shared multimodal sequences, allowing it to learn from both images and videos simultaneously. To address interference among different visual tokens, the model employs modality-aware rotary positional encoding, which helps to align tasks across different modalities.

During training, Lance uses a staged multi-task training paradigm with capability-oriented objectives and adaptive data scheduling. This approach strengthens both semantic comprehension and visual generation performance. The results show that Lance outperforms existing unified models in image and video generation while maintaining strong multimodal understanding capabilities.

Overall, Lance presents a practical approach to unified multimodal modeling, demonstrating that collaborative multi-task training and a dual-stream architecture can lead to improved performance in multiple tasks without requiring large model capacity. The model has the potential to be applied to various applications that require multimodal understanding, generation, and editing capabilities.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18678
• PDF: https://arxiv.org/pdf/2605.18678
• Project Page: https://lance-project.github.io/

🤖 Models citing this paper:
https://huggingface.co/bytedance-research/Lance

🚀 Spaces citing this paper:
https://huggingface.co/spaces/Nayefleb/Lance

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalModeling #MultitaskLearning #DualStreamArchitecture #MixtureOfExperts #UnifiedModelingApproach
🔥 LongLive-2.0: An NVFP4 Parallel Infrastructure for Long Video Generation

💡 The paper introduces LongLive-2.0, a parallel infrastructure for long video generation that addresses training and inference bottlenecks. The problem with existing methods is that they are slow and require a lot of memory, especially for long videos. To solve this, the authors propose a sequence-parallel autoregressive training method called Balanced SP, which pairs clean-history and noisy-target temporal chunks on each rank, enabling efficient teacher-forcing and reducing GPU memory cost.

The method also uses NVFP4 precision to accelerate GEMM computation during training. Additionally, the authors tune a diffusion model into a long, multi-shot, interactive auto-regressive diffusion model, which can be converted to real-time generation with standalone LoRA weights. For inference, the authors enable W4A4 NVFP4 inference, quantize KV cache into NVFP4 for memory savings, and boost end-to-end throughput with asynchronous streaming VAE decoding.

The results show that LongLive-2.0 achieves up to 2.15x speedup in training and 1.84x in inference. The LongLive-2.0-5B model achieves 45.7 FPS inference while attaining strong performance on benchmarks. The authors claim that LongLive-2.0 is the first NVFP4 training and inference system for long video generation, making it a significant contribution to the field. Overall, the paper presents a novel parallel infrastructure that addresses the speed and memory bottlenecks in long video generation, making it possible to generate high-quality videos in real-time.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18739
• PDF: https://arxiv.org/pdf/2605.18739
• Project Page: https://nvlabs.github.io/LongLive/LongLive2/

🤖 Models citing this paper:
https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B
https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S4
https://huggingface.co/Efficient-Large-Model/LongLive-2.0-5B-NVFP4-S2

📊 Datasets citing this paper:
https://huggingface.co/datasets/Efficient-Large-Model/LongLive2.0-Toy-Dataset

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongVideoGeneration #ParallelInfrastructure #NVFP4 #AutoregressiveTraining #DiffusionModeling
AI & ML Papers
Photo
🔥 AI for Auto-Research: Roadmap & User Guide

💡 The paper AI for Auto-Research Roadmap and User Guide examines the role of artificial intelligence in the research process, highlighting both its potential and limitations. The authors note that while AI systems can excel in structured tasks such as data analysis and paper writing, they often struggle with novel ideas, scientific judgment, and research-level experiments, requiring human oversight to ensure credible outcomes.

To investigate this further, the authors conducted an end-to-end analysis of AI across the entire research lifecycle, dividing it into four phases: Creation, Writing, Validation, and Dissemination. They found that AI is reliable in tasks that are structured, retrieval-grounded, and tool-mediated, but fragile when it comes to genuinely novel ideas and scientific judgment.

The study reveals that generated ideas often degrade after implementation, research code lags behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. Moreover, the authors show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm.

The paper provides several contributions, including a structured taxonomy, benchmark suite, and tool inventory, as well as cross-stage design principles and a practitioner-oriented playbook. The authors also maintain a project page with resources for further exploration. Overall, the study highlights the importance of human-AI collaboration in research, emphasizing that while AI can be a powerful tool, it is not yet ready to replace human scientists and researchers.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18661
• PDF: https://arxiv.org/pdf/2605.18661
• Project Page: https://worldbench.github.io/awesome-ai-auto-research

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceInResearch #AutoResearchTechnologies #AIForScientificDiscovery #MachineLearningInAcademia #ResearchProcessAutomation
AI & ML Papers
Photo
🔥 Adaptive Chunking: Optimizing Chunking-Method Selection for RAG

💡 The paper introduces Adaptive Chunking, a framework that optimizes chunking method selection for Retrieval-Augmented Generation RAG by using intrinsic document metrics. The effectiveness of RAG depends on how documents are segmented into smaller units, but traditional one-size-fits-all approaches often fail to capture the nuances of diverse texts. To address this, the authors propose a framework that selects the most suitable chunking strategy for each document based on five novel metrics: References Completeness, Intrachunk Cohesion, Document Contextual Coherence, Block Integrity, and Size Compliance. These metrics assess chunking quality across key dimensions. The authors also introduce two new chunkers and targeted post-processing techniques to support the framework. The results show that the adaptive method significantly improves downstream RAG performance, increasing answer correctness to 72% and the number of successfully answered questions by over 30%, without changing models or prompts. The framework demonstrates that adaptive, document-aware chunking guided by intrinsic metrics offers a practical path to more robust RAG systems. The code for the framework is available, making it possible for others to implement and build upon the research. Overall, the paper contributes to the development of more effective RAG systems by providing a novel approach to chunking that takes into account the unique characteristics of each document.


📅 Published on Mar 26

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2603.25333
• PDF: https://arxiv.org/pdf/2603.25333

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdaptiveChunking #RetrievalAugmentedGeneration #ChunkingMethodOptimization #DocumentSegmentationTechniques #RAGModelImprovements
AI & ML Papers
Photo
🔥 Code-as-Room: Generating 3D Rooms from Top-Down View Images via Agentic Code Synthesis

💡 The paper proposes a novel framework called Code-as-Room for generating 3D indoor rooms from top-down view images. The problem addressed is the difficulty in designing realistic and functional 3D indoor rooms, which is essential for various applications such as interior design, virtual reality, and gaming. Existing methods that use text-based descriptions or reference images struggle to capture precise spatial information and suffer from instability and infinite looping when tasked with holistic room generation.

The proposed method, Code-as-Room, uses a multilayer language model-based agentic framework with a structured execution harness to generate executable Blender code from top-down images. The framework parses the reference image to extract scene elements and their spatial relationships and synthesizes code for geometry, materials, and lighting in a multi-stage pipeline. A cross-stage memory module is used to maintain context and mitigate context forgetting.

The results show that the proposed framework is effective in generating 3D rooms from top-down images. A dedicated benchmark for code-based 3D room synthesis is introduced, which encompasses various evaluation protocols. Comprehensive comparisons against existing agent-based methods are conducted, validating the effectiveness of the proposed execution harness. The paper contributes to the field by providing a principled approach to 3D room synthesis from top-down views, addressing the limitations of existing methods and demonstrating the potential of using executable code as a representation for 3D rooms.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18451
• PDF: https://arxiv.org/pdf/2605.18451
• Project Page: https://code-as-room.github.io/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#CodeAsRoom #3DRoomGeneration #AgenticCodeSynthesis #IndoorSceneUnderstanding #ArchitectureGeneration
AI & ML Papers
Photo
🔥 SkillsVote: Lifecycle Governance of Agent Skills from Collection, Recommendation to Evolution

💡 The paper introduces SkillsVote, a governance framework for managing reusable skills in long-horizon large language model agents. The problem addressed is that raw trajectories of agent experiences are noisy and hard to govern, making it difficult to reuse and improve agent skills. To solve this, the authors propose treating agent skills as an experience schema that combines executable scripts with non-executable guidance on procedures.

The SkillsVote framework consists of three main processes: collection, recommendation, and evolution of agent skills. It starts by profiling a large open-source corpus of skills to identify environment requirements, quality, and verifiability. Then, it synthesizes tasks for verifiable skills and performs a search over a structured skill library to provide instructional context before execution. After execution, it decomposes trajectories into skill-linked subtasks, attributes outcomes to skill use, and admits only successful reusable discoveries to updates.

The evaluation of SkillsVote shows promising results, with offline evolution improving performance on Terminal-Bench 2.0 by up to 7.9 percentage points and online evolution improving performance on SWE-Bench Pro by up to 2.6 percentage points. The key contribution of the paper is that governed external skill libraries can improve frozen agents without requiring model updates, as long as systems control exposure, credit, and preservation of skills. Overall, the SkillsVote framework provides a structured approach to managing and improving agent skills, enabling more efficient and effective reuse of experience and knowledge.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18401
• PDF: https://arxiv.org/pdf/2605.18401
• Project Page: https://skills.vote

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgentGovernance #LargeLanguageModels #SkillEvolution #ReusableSkills #LifecycleManagement
AI & ML Papers
Photo
🔥 AstraFlow: Dataflow-Oriented Reinforcement Learning for Agentic LLMs

💡 The paper introduces AstraFlow, a dataflow-oriented reinforcement learning system designed to improve the efficiency and scalability of large language model agents. The problem addressed is that current reinforcement learning systems are prohibitively expensive and struggle to support complex workloads, such as multi-policy collaborative training, while efficiently using diverse compute resources.

The authors propose AstraFlow as a solution, which replaces conventional trainer-centered control with principled component abstractions. In AstraFlow, rollout services, dataflow management, and training are decoupled into autonomous components, allowing the system to natively support complex multi-policy agentic RL workloads and efficiently exploit diverse compute resources.

The results show that AstraFlow supports multi-policy training, elastic scaling, heterogeneous cross-region execution, and composable data algorithms without requiring system-level code changes. The system achieves comparable or better accuracy than existing RL systems while speeding up training time by 2.7 times in multi-policy collaborative training. The evaluation is done across various workloads, including math, code, search, and AgentBench, demonstrating the system's versatility and efficiency.

Overall, AstraFlow's contributions include its ability to efficiently support complex workloads, scale to large language model agents, and provide a principled abstraction for reinforcement learning system components, making it a significant advancement in the field of reinforcement learning for large language models.


📅 Published on May 15

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.15565
• PDF: https://arxiv.org/pdf/2605.15565
• Project Page: https://infini-ai-lab.github.io/astraflow/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DataflowOrientedRL #ReinforcementLearningForLLMs #AgenticLanguageModels #LargeLanguageModelAgents #ScalableRLSystems
AI & ML Papers
Photo
🔥 SAM 3: Segment Anything with Concepts

💡 The paper introduces Segment Anything Model 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts. The model achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization. The concept prompts can be short noun phrases, image exemplars, or a combination of both, and the model returns segmentation masks and unique identities for all matching object instances.

To advance promptable concept segmentation, the authors built a scalable data engine that produces a high-quality dataset with 4 million unique concept labels, including hard negatives, across images and videos. The model consists of an image-level detector and a memory-based video tracker that share a single backbone. The recognition and localization are decoupled with a presence head, which boosts detection accuracy.

The results show that Segment Anything Model 3 doubles the accuracy of existing systems in both image and video promptable concept segmentation, and improves previous capabilities on visual segmentation tasks. The authors also open source Segment Anything Model 3 along with a new benchmark for promptable concept segmentation, called Segment Anything with Concepts.

The main contributions of the paper are the introduction of a unified model architecture that achieves state-of-the-art performance in promptable concept segmentation and tracking, the creation of a large-scale dataset with unique concept labels, and the development of a new benchmark for evaluating promptable concept segmentation models. Overall, the paper presents a significant advancement in the field of computer vision and object segmentation, enabling more accurate and efficient detection, segmentation, and tracking of objects in images and videos based on concept prompts.


📅 Published on Nov 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.16719
• PDF: https://arxiv.org/pdf/2511.16719
• Project Page: https://ai.meta.com/sam3/

🤖 Models citing this paper:
https://huggingface.co/AllanVester/SAM3.1-CoreML-FP16
https://huggingface.co/AllanVester/SAM3.1-CoreML
https://huggingface.co/embedl/sam3

🚀 Spaces citing this paper:
https://huggingface.co/spaces/kith777/rag_agent

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ObjectSegmentation #ConceptLearning #ImageTracking #PromptableSegmentation
AI & ML Papers
Photo
🔥 PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

💡 The paper addresses the problem of visual typesetting optimization for scientific documents, which involves transforming a compilable LaTeX paper into a visually polished and page-budget-compliant PDF. The authors argue that existing methods, such as rule-based tools and text-only language models, are insufficient because they operate only on source code and log files, and are unable to predict or verify the two-dimensional layout consequences of their changes.

To solve this problem, the authors introduce a vision-in-the-loop agent called PaperFit, which iteratively renders pages, diagnoses defects, and applies constrained repairs. The authors also formalize the problem as Visual Typesetting Optimization, and introduce a five-category taxonomy of typesetting defects to guide diagnosis.

To evaluate PaperFit, the authors construct a benchmark called PaperFit-Bench, which consists of 200 papers across 10 venue templates and 13 defect types at different difficulty levels. The results of extensive experiments show that PaperFit outperforms all baselines by a large margin, demonstrating the effectiveness of vision-in-the-loop optimization for visual typesetting optimization.

The authors conclude that bridging the gap from compilable source to publication-ready PDF requires vision-in-the-loop optimization, and that Visual Typesetting Optimization constitutes a critical missing stage in the document automation pipeline. Overall, the paper contributes a new approach to visual typesetting optimization, a benchmark for evaluating VTO methods, and a demonstration of the importance of vision-in-the-loop optimization for producing high-quality scientific documents.


📅 Published on May 11

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.10341
• PDF: https://arxiv.org/pdf/2605.10341

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VisionInLoopTypesetting #ScientificDocumentOptimization #LaTeXTypesetting #DocumentLayoutOptimization #TypesettingAutomation
2
🙏💸 500$ FOR THE FIRST 500 WHO JOIN THE CHANNEL! 🙏💸

Join our channel today for free! Tomorrow it will cost 500$!

https://xn--r1a.website/+-WZeIeP8YI8wM2E6

You can join at this link! 👆👇

https://xn--r1a.website/+-WZeIeP8YI8wM2E6
AI & ML Papers
Photo
🔥 EnvFactory: Scaling Tool-Use Agents via Executable Environments Synthesis and Robust RL

💡 The paper introduces EnvFactory, a framework that automates the creation of executable tool environments and natural multi-turn trajectories for training large language models with agentic reinforcement learning. The problem addressed is that current approaches to equip large language models with tool-use capabilities are limited by the lack of scalable and robust execution environments and the scarcity of realistic training data. Existing methods rely on costly real-world APIs, simulators that are prone to hallucination, or synthetic environments that are often single-turn or based on pre-collected documents.

EnvFactory addresses these challenges by autonomously exploring and verifying stateful, executable tool environments from authentic resources, and synthesizing natural multi-turn trajectories through topology-aware sampling and calibrated refinement. This approach produces grounded queries with implicit intents, which are more effective for reinforcement learning training.

The method involves using a fully automated framework to generate environments and trajectories. The results show that using only 85 verified environments across 7 domains, EnvFactory generates a large number of trajectories, achieving superior training efficiency and downstream performance. The framework improves the performance of Qwen3-series models by up to 15 percent on certain benchmarks, and by up to 8.6 percent and 6 percent on other conversational benchmarks.

The contributions of the paper are that EnvFactory provides a scalable, extensible, and robust foundation for agentic reinforcement learning, and that it achieves superior performance with fewer resources compared to prior work. The framework has the potential to advance the field of large language models and their application to real-world problems. Overall, the paper presents a significant contribution to the field of artificial intelligence and natural language processing.


📅 Published on May 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.18703
• PDF: https://arxiv.org/pdf/2605.18703

🤖 Models citing this paper:
https://huggingface.co/LARK-Lab/EnvFactory-1.7B
https://huggingface.co/LARK-Lab/EnvFactory-4B
https://huggingface.co/LARK-Lab/EnvFactory-8B

📊 Datasets citing this paper:
https://huggingface.co/datasets/LARK-Lab/EnvFactory-SFT-ALL
https://huggingface.co/datasets/LARK-Lab/EnvFactory-SFT-FILTERED
https://huggingface.co/datasets/LARK-Lab/EnvFactory-RL

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ExecutableEnvironments #ToolUseAgents #AgenticReinforcementLearning #RobustRL #LanguageModelTraining