AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
AI & ML Papers
Photo
🔥 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

💡 The paper introduces SmolDocling, a compact vision-language model designed for end-to-end document conversion. The model aims to process entire pages and generate a new universal markup format called DocTags, which captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models or ensemble solutions, SmolDocling offers a single end-to-end conversion model with 256M parameters. This approach allows for accurately capturing content, structure, and spatial location of document elements.

The model is trained to reproduce document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types, including business documents, academic papers, technical reports, patents, and forms. The authors also contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.

Experimental results demonstrate that SmolDocling performs competitively with other vision language models that are up to 27 times larger in size, while reducing computational requirements substantially. The model's compact size and robust performance make it a significant contribution to the field of document conversion. The authors plan to make the model and datasets publicly available, which will facilitate further research and development in this area. Overall, SmolDocling offers a efficient and effective solution for end-to-end document conversion, with potential applications in various industries and domains.


📅 Published on Mar 14, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2503.11576
• PDF: https://arxiv.org/pdf/2503.11576
• Project Page: https://huggingface.co/ds4sd/SmolDocling-256M-preview
• GitHub: https://github.com/docling-project/docling 59.1k

🤖 Models citing this paper:
https://huggingface.co/docling-project/SmolDocling-256M-preview
https://huggingface.co/ibm-granite/granite-docling-258M
https://huggingface.co/docling-project/CodeFormulaV2

📊 Datasets citing this paper:
https://huggingface.co/datasets/mnezama/SynthCodeNet
https://huggingface.co/datasets/docling-project/SynthCodeNet
https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix

🚀 Spaces citing this paper:
https://huggingface.co/spaces/ibm-granite/granite-docling-258m-demo
https://huggingface.co/spaces/ibm-granite/granite-docling-258M-WebGPU
https://huggingface.co/spaces/jairwaal/image

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentConversion #VisionLanguageModel #MultimodalProcessing #EndToEndLearning #DocumentUnderstanding
AI & ML Papers
Photo
🔥 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

💡 The paper introduces OpenDevin, a platform for developing artificial intelligence agents that can interact with the world in ways similar to human developers. The platform allows AI agents to write code, use command lines, and browse the web, with support for multiple agents and evaluation benchmarks. The goal of OpenDevin is to provide a flexible and powerful platform for AI agents to interact with their environment, similar to how human developers use software to interact with the world.

The method used to develop OpenDevin involves creating a platform that can support the implementation of new agents, safe interaction with sandboxed environments for code execution, and coordination between multiple agents. The platform also incorporates evaluation benchmarks to assess the performance of the agents. The development of OpenDevin is a community effort, with contributions from over 160 contributors, and is released under the permissive MIT license.

The results of the paper include the evaluation of agents over 15 challenging tasks, including software engineering and web browsing. The evaluation benchmarks demonstrate the flexibility and power of the OpenDevin platform in supporting AI agents that can interact with the world in complex ways. The paper also highlights the potential of OpenDevin to improve over time, with ongoing contributions from the community. Overall, the paper contributes to the development of AI agents that can interact with the world in ways similar to human developers, with potential applications in a wide range of areas, including software engineering and web development.


📅 Published on Jul 23, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2407.16741
• PDF: https://arxiv.org/pdf/2407.16741
• GitHub: https://github.com/opendevin/opendevin 72.6k

📊 Datasets citing this paper:
https://huggingface.co/datasets/GloriaaaM/LLM-Agent-Harness-Survey
https://huggingface.co/datasets/namanvats/harbor-goose-openhands-benchmark
https://huggingface.co/datasets/antieval/aware-bench-trajectories

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #AIPlatformDevelopment #GeneralistAgents #HumanCentricAI #AIEnvironmentInteraction
1
AI & ML Papers
Photo
🔥 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

💡 The paper proposes PaddleOCR-VL, a state-of-the-art and resource-efficient model for document parsing. The problem addressed is the need for a model that can accurately recognize elements in documents, such as text, tables, formulas, and charts, while being efficient in terms of resource consumption. To solve this problem, the authors propose a vision-language model that combines a NaViT-style dynamic resolution visual encoder with the ERNIE language model. The resulting model, PaddleOCR-VL-0.9B, is a compact yet powerful model that can support 109 languages and recognize complex elements with high accuracy. The method used to achieve this is the integration of the visual encoder and language model, which enables the model to efficiently process documents and recognize elements. The results show that PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition, outperforming existing solutions and exhibiting strong competitiveness against top-tier vision-language models. The model also delivers fast inference speeds, making it highly suitable for practical deployment in real-world scenarios. The code for the model is available, making it accessible for further research and development. Overall, the paper contributes a highly efficient and accurate model for document parsing, which can be used in a variety of applications.


📅 Published on Oct 16, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR 77.1k

🤖 Models citing this paper:
https://huggingface.co/PaddlePaddle/PaddleOCR-VL
https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
https://huggingface.co/unsloth/PaddleOCR-VL

📊 Datasets citing this paper:
https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico

🚀 Spaces citing this paper:
https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning
3
🔥 LTX-2: Efficient Joint Audio-Visual Foundation Model

💡 The paper introduces LTX-2, an open-source audiovisual diffusion model that generates synchronized video and audio content. The problem addressed is that current text-to-video diffusion models can generate compelling video sequences but lack the semantic, emotional, and atmospheric cues that audio provides. To solve this, the authors propose a dual-stream transformer architecture with cross-modal attention and classifier-free guidance. The model consists of a 14 billion parameter video stream and a 5 billion parameter audio stream, coupled through bidirectional audio-video cross-attention layers. This architecture enables efficient training and inference of a unified audiovisual model, with more capacity allocated for video generation than audio generation.

The method used to achieve this includes employing a multilingual text encoder for broader prompt understanding and introducing a modality-aware classifier-free guidance mechanism for improved audiovisual alignment and controllability. The model is trained using a combination of video and audio data, with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning.

The results show that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. The model is capable of generating high-quality, temporally synchronized audiovisual content, including rich and coherent audio tracks that follow the characters, environment, style, and emotion of each scene. The model weights and code are publicly released, making it accessible for further research and development. Overall, LTX-2 provides a significant contribution to the field of audiovisual generation, enabling the creation of more realistic and engaging multimedia content.


📅 Published on Jan 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2601.03233
• PDF: https://arxiv.org/pdf/2601.03233
• Project Page: https://app.ltx.studio/ltx-2-playground/i2v
• GitHub: https://github.com/Lightricks/LTX-2 6.4k

🤖 Models citing this paper:
https://huggingface.co/Lightricks/LTX-2
https://huggingface.co/Lightricks/LTX-2.3
https://huggingface.co/unsloth/LTX-2.3-GGUF

🚀 Spaces citing this paper:
https://huggingface.co/spaces/linoyts/LTX-2-3-First-Last-Frame
https://huggingface.co/spaces/linoyts/LTX-2-3-sync
https://huggingface.co/spaces/linoyts/LTX-2-3-outpaint

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualLearning #MultimodalDiffusionModels #CrossModalAttention #AudioVideoGeneration #JointFoundationModels
3
Please open Telegram to view this post
VIEW IN TELEGRAM
AI & ML Papers
Photo
🔥 LightRAG: Simple and Fast Retrieval-Augmented Generation

💡 The paper introduces LightRAG, a novel approach to improve Retrieval-Augmented Generation systems, which enhance large language models by integrating external knowledge sources. Existing systems have limitations, including reliance on flat data representations and inadequate contextual awareness, leading to fragmented answers that fail to capture complex inter-dependencies. To address these challenges, LightRAG incorporates graph structures into text indexing and retrieval processes, employing a dual-level retrieval system that enhances comprehensive information retrieval from both low-level and high-level knowledge discovery. The integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. An incremental update algorithm ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. The experimental results demonstrate considerable improvements in retrieval accuracy and efficiency compared to existing approaches, making LightRAG a significant contribution to the field of Retrieval-Augmented Generation. The authors have made LightRAG open-source, making it available for further development and application. Overall, LightRAG provides a simple and fast retrieval-augmented generation approach that achieves better accuracy and response times, making it a valuable tool for data science applications.


📅 Published on Oct 8, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2410.05779
• PDF: https://arxiv.org/pdf/2410.05779
• GitHub: https://github.com/hkuds/lightrag 34.7k
• Project Page: https://huggingface.co/Neha12210/project2-advanced-rag

🤖 Models citing this paper:
https://huggingface.co/muthuk1/graphrag-inference-hackathon
https://huggingface.co/atad-tokyo/GST_LIVING_NOVEL
https://huggingface.co/Neha12210/project2-advanced-rag

🚀 Spaces citing this paper:
https://huggingface.co/spaces/rm-lht/lightrag

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RetrievalAugmentedGeneration #GraphBasedInformationRetrieval #KnowledgeDiscoverySystems #LargeLanguageModels #TextIndexingTechniques
👍1
AI & ML Papers
Photo
🔥 RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

💡 The paper introduces RF-DETR, a lightweight detection transformer that uses neural architecture search to optimize accuracy and latency for real-time object detection. The motivation behind this work is that current state-of-the-art detectors often fail to generalize to real-world datasets with classes not seen during pre-training. Instead of fine-tuning a large vision-language model, the authors propose RF-DETR, which fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations to find the best accuracy-latency tradeoff. The approach uses weight-sharing neural architecture search to improve transferability to diverse target domains. The results show that RF-DETR significantly outperforms prior state-of-the-art real-time methods on several datasets, including COCO and Roboflow100-VL. Specifically, RF-DETR achieves 48.0 AP on COCO, beating a similar method by 5.3 AP at similar latency, and RF-DETR also outperforms another method on Roboflow100-VL while running 20 times as fast. Notably, RF-DETR is the first real-time detector to surpass 60 AP on COCO, demonstrating its effectiveness in achieving high accuracy and low latency. The code for RF-DETR is made available, allowing for further research and development in the field of real-time object detection.


📅 Published on Nov 12, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2511.09554
• PDF: https://arxiv.org/pdf/2511.09554
• Project Page: https://rfdetr.roboflow.com/1.3.0/
• GitHub: https://github.com/roboflow/rf-detr 6.9k

🤖 Models citing this paper:
https://huggingface.co/stevenbucaille/rf-detr-small
https://huggingface.co/stevenbucaille/rf-detr-nano
https://huggingface.co/stevenbucaille/rf-detr-base

🚀 Spaces citing this paper:
https://huggingface.co/spaces/arihant3704/rf-detr-playground

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RealTimeObjectDetection #NeuralArchitectureSearch #DetectionTransformers #WeightSharingNAS #EfficientComputerVision
1
AI & ML Papers
Photo
🔥 Agent READMEs: An Empirical Study of Context Files for Agentic Coding

💡 This paper presents the first large scale empirical study of agent context files, also known as agent READMEs, which provide persistent and project level instructions for agentic coding tools. The study analyzed 2303 agent context files from 1925 repositories to characterize their structure, maintenance, and content. The researchers found that these files are not static documentation but complex and difficult to read artifacts that evolve like configuration code, maintained through frequent small additions.

The content analysis of 16 instruction types revealed that developers prioritize functional context, such as build and run commands, implementation details, and architecture. However, the study also identified a significant gap, as non-functional requirements like security and performance are rarely specified. The findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant.

The study highlights the need for improved tooling and practices to address this gap. The contributions of this paper include a comprehensive understanding of the structure and content of agent context files, and the identification of areas for improvement to ensure that agentic coding tools produce secure and performant code. The research has implications for the development of agentic coding tools and the use of agent context files in software development projects. Overall, the study provides valuable insights into the use of agent context files and highlights the need for further research and development to improve the security and performance of agent-written code.


📅 Published on Nov 17, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2511.12884
• PDF: https://arxiv.org/pdf/2511.12884
• Project Page: https://agents.md
• GitHub: https://github.com/openai/agents.md 21.0k

📊 Datasets citing this paper:
https://huggingface.co/datasets/hao-li/AIDev
https://huggingface.co/datasets/farida5gaber/AIDev
https://huggingface.co/datasets/dysavepeople/AIDev

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticCodingTools #AgentContextFiles #EmpiricalSoftwareEngineering #AgenticREADMEs #SoftwareDevelopmentArtifacts
❤‍🔥1
AI & ML Papers
Photo
🔥 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

💡 The paper introduces OmniFlatten, a novel end-to-end GPT model that enables real-time natural full-duplex spoken dialogue. The goal is to achieve low latency and natural interactions in full-duplex dialogue systems, which is a significant challenge due to human conversation dynamics such as interruptions, backchannels, and overlapping speech. To address this, the authors propose a multi-stage post-training technique that integrates speech and text without altering the original model's architecture. The training process consists of three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. A flattening operation is used to standardize the data, allowing for unified training methods and model architecture across different modalities and tasks. The OmniFlatten model can generate text and speech in real-time, effectively modeling complex behaviors inherent to natural conversations. The approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. The results are demonstrated through audio samples of dialogues generated by OmniFlatten, which can be found online. Overall, the paper contributes to the development of full-duplex spoken dialogue systems that can mimic human-human interactions, with potential applications in various areas such as virtual assistants, customer service, and more.


📅 Published on Oct 23, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2410.17799
• PDF: https://arxiv.org/pdf/2410.17799
• GitHub: https://github.com/karpathy/nanogpt 57.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GPTModelArchitecture #FullDuplexDialogueSystems #NaturalLanguageProcessing #SpeechRecognitionTechniques #EndToEndConversationalAI
AI & ML Papers
Photo
🔥 DeepSeek-V3 Technical Report

💡 DeepSeek-V3 is a language model that achieves high performance with efficient training and minimal computational cost. The model uses a Mixture-of-Experts architecture with 671 billion total parameters, but only 37 billion are activated for each token, making it parameter-efficient. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention and DeepSeekMoE architectures, which were validated in the previous version of the model.

The model is trained on 14.8 trillion diverse and high-quality tokens, followed by supervised fine-tuning and reinforcement learning stages to fully harness its capabilities. The training process is stable and requires only 2.788 million H800 GPU hours for full training, which is relatively low compared to other models.

The results show that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. The model also pioneers an auxiliary-loss-free strategy for load balancing and uses a multi-token prediction training objective for stronger performance. The model checkpoints are available for further research and development.

Overall, the DeepSeek-V3 model makes significant contributions to the field of natural language processing by providing a highly efficient and effective language model that can be trained with minimal computational resources. The model's stable training process and low computational cost make it an attractive option for researchers and developers who want to build high-performance language models without incurring high costs.


📅 Published on Dec 27, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2412.19437
• PDF: https://arxiv.org/pdf/2412.19437
• GitHub: https://github.com/deepseek-ai/deepseek-v3 103.4k

🤖 Models citing this paper:
https://huggingface.co/deepseek-ai/DeepSeek-V3
https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

📊 Datasets citing this paper:
https://huggingface.co/datasets/alpha-one-index/awesome-ai-index
https://huggingface.co/datasets/jeffliulab/visinject
https://huggingface.co/datasets/AcroYAMALEX/acro-yamalex-llmjp-4-math-cot

🚀 Spaces citing this paper:
https://huggingface.co/spaces/nanotron/ultrascale-playbook
https://huggingface.co/spaces/Ki-Seki/ultrascale-playbook-zh-cn
https://huggingface.co/spaces/weege007/ultrascale-playbook

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MixtureOfExpertsArchitecture #DeepLearningModels #ParameterEfficientTraining #LatentAttentionMechanisms #EfficientLanguageModeling
AI & ML Papers
Photo
🔥 LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

💡 The paper introduces LlamaFactory, a unified framework that enables efficient fine-tuning of large language models across various tasks. The problem addressed is that fine-tuning these models requires significant effort and coding expertise, which can be a barrier for many users. To solve this, LlamaFactory integrates a suite of cutting-edge efficient training methods, allowing users to customize the fine-tuning of over 100 language models without needing to write code. This is made possible through a web-based user interface called LlamaBoard, which provides a flexible and user-friendly way to fine-tune language models. The authors validate the efficiency and effectiveness of LlamaFactory on language modeling and text generation tasks, demonstrating its potential. The framework has been released publicly and has already gained significant attention, with over 13,000 stars and 1,600 forks on GitHub. Overall, LlamaFactory contributes to the field by providing a unified and accessible way to fine-tune large language models, making it easier for researchers and practitioners to adapt these models to specific tasks and applications.


📅 Published on Mar 20, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2403.13372
• PDF: https://arxiv.org/pdf/2403.13372
• Project Page: https://huggingface.co/spaces/hiyouga/LLaMA-Board
• GitHub: https://github.com/hiyouga/LLaMA-Factory 70.9k

🤖 Models citing this paper:
https://huggingface.co/AELLM/Llama-3.2-Chibi-3B
https://huggingface.co/GXMZU/Qwen3-14B-ai-expert
https://huggingface.co/Xin-Rui/LLAMA-Fac-NEW-A800

🚀 Spaces citing this paper:
https://huggingface.co/spaces/hiyouga/LLaMA-Board
https://huggingface.co/spaces/Justinrune/LLaMA-Factory
https://huggingface.co/spaces/Darok/Featherless-Feud

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#EfficientFineTuning #LanguageModelOptimization #UnifiedTrainingFrameworks #LargeLanguageModelDevelopment #AutomatedModelCustomization
AI & ML Papers
Photo
🔥 LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

💡 The paper introduces LeWorldModel, a stable end to end joint embedding predictive architecture that trains efficiently from raw pixels. Existing methods for learning world models in compact latent spaces are fragile and rely on complex loss terms, pre trained encoders, or auxiliary supervision to avoid representation collapse. LeWorldModel addresses this issue by using only two loss terms, a next embedding prediction loss and a regularizer, to train the model end to end from raw pixels. This approach reduces the number of tunable loss hyperparameters from six to one compared to existing methods. The model has approximately 15 million parameters and can be trained on a single GPU in a few hours, making it up to 48 times faster than foundation model based world models. The results show that LeWorldModel remains competitive across diverse 2D and 3D control tasks and encodes meaningful physical structures in its latent space. The model is also able to reliably detect physically implausible events, demonstrating its ability to learn a robust and generalizable representation of the world. Overall, LeWorldModel provides a stable and efficient framework for learning world models from raw pixels, making it a significant contribution to the field of artificial intelligence.


📅 Published on Mar 13

🔗 Links:
• arXiv: https://arxiv.org/abs/2603.19312
• PDF: https://arxiv.org/pdf/2603.19312
• Project Page: https://le-wm.github.io/
• GitHub: https://github.com/lucas-maes/le-wm 3.1k

🤖 Models citing this paper:
https://huggingface.co/quentinll/lewm-pusht
https://huggingface.co/aguennoune17/atlas-v2-nwm-fp8-compressed
https://huggingface.co/quentinll/lewm-tworooms

📊 Datasets citing this paper:
https://huggingface.co/datasets/quentinll/lewm-pusht
https://huggingface.co/datasets/quentinll/lewm-cube
https://huggingface.co/datasets/quentinll/lewm-reacher

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#WorldModels #JointEmbedding #PredictiveArchitectures #EndToEndLearning #LatentSpaceRepresentation
1