AI & ML Papers

A decoder-only foundation model for time-series forecasting

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on...

❤3

596 views14:55

450 views16:55

🔥 Efficient Memory Management for Large Language Model Serving with PagedAttention

💡 The paper addresses the problem of efficient memory management for large language models, which is crucial for high throughput serving. Existing systems struggle with managing the key-value cache memory, which is huge and dynamically grows and shrinks, resulting in significant waste due to fragmentation and redundant duplication. To solve this problem, the authors propose PagedAttention, an attention algorithm inspired by classical virtual memory and paging techniques in operating systems. They also build vLLM, a large language model serving system that achieves near-zero waste in key-value cache memory and flexible sharing of the cache within and across requests. The vLLM system is designed to reduce memory usage and improve throughput. The authors evaluate vLLM and show that it improves the throughput of popular large language models by 2-4 times with the same level of latency compared to state-of-the-art systems. The improvement is more significant with longer sequences, larger models, and more complex decoding algorithms. Overall, the paper contributes to the development of efficient memory management for large language models, enabling higher throughput and better performance.

📅 Published on Sep 12, 2023

🔗 Links:
• arXiv: https://arxiv.org/abs/2309.06180
• PDF: https://arxiv.org/pdf/2309.06180
• GitHub: https://github.com/vllm-project/vllm ⭐ 79.0k

🤖 Models citing this paper:
• https://huggingface.co/theonlyengine/Flash-attention1
• https://huggingface.co/enfinity7B/apac

📊 Datasets citing this paper:
• https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Vrushali777/vllm-inference-benchmark

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #EfficientMemoryManagement #PagedAttention #LanguageModelServing #KeyValueCacheOptimization

Efficient Memory Management for Large Language Model Serving with...

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for...

❤5

562 views16:55

413 views18:55

🔥 RAG-Anything: All-in-One RAG Framework

💡 The paper introduces RAG-Anything, a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching. The problem addressed is that current Retrieval-Augmented Generation frameworks are limited to textual content, creating gaps when processing multimodal documents that contain a combination of text, images, tables, and mathematical expressions.

The proposed method, RAG-Anything, reconceptualizes multimodal content as interconnected knowledge entities, introducing dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. The framework develops cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching, enabling effective reasoning over heterogeneous content where relevant evidence spans multiple modalities.

The results show that RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. The performance gains are particularly pronounced on long documents where traditional approaches fail. The framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. The RAG-Anything framework is open-sourced, making it available for further development and application. Overall, the paper contributes to the development of a more comprehensive and effective knowledge retrieval system that can handle multimodal content.

📅 Published on Oct 14, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.12323
• PDF: https://arxiv.org/pdf/2510.12323
• GitHub: https://github.com/HKUDS/RAG-Anything ⭐ 19.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalKnowledgeRetrieval #CrossModalRelationships #RetrievalAugmentedGeneration #MultimodalDocumentProcessing #SemanticMatching

RAG-Anything: All-in-One RAG Framework

Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists...

❤3

556 views18:55

434 views20:55

🔥 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

💡 The paper introduces Mem0, a memory-centric architecture designed to improve long-term conversational coherence in large language models. The problem addressed is that existing models have fixed context windows, making it challenging to maintain consistency over prolonged multi-session dialogues. To solve this, Mem0 dynamically extracts, consolidates, and retrieves salient information from ongoing conversations. The method uses a graph-based memory representation to capture complex relational structures among conversational elements.

The authors evaluated Mem0 against six baseline categories, including established memory-augmented systems, retrieval-augmented generation, and a full-context approach. The results show that Mem0 consistently outperforms all existing memory systems across four question categories, achieving a 26 percent relative improvement in the LLM-as-a-Judge metric over OpenAI. Additionally, Mem0 with graph memory achieves a higher overall score than the base configuration.

The method not only improves accuracy but also reduces computational overhead, with a 91 percent lower p95 latency and more than 90 percent token cost savings compared to the full-context method. The findings highlight the critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient large language model-driven AI agents. Overall, Mem0 offers a compelling balance between advanced reasoning capabilities and practical deployment constraints.

📅 Published on Apr 28, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2504.19413
• PDF: https://arxiv.org/pdf/2504.19413
• Project Page: https://mem0.ai/research
• GitHub: https://github.com/mem0ai/mem0 ⭐ 54.8k

📊 Datasets citing this paper:
• https://huggingface.co/datasets/GloriaaaM/LLM-Agent-Harness-Survey

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Pratham13/research-paper-finder

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongTermMemoryInAI #ConversationalCoherence #MemoryCentricArchitecture #GraphBasedMemoryRepresentation #ScalableLanguageModels

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining...

❤1

472 views20:55

326 views22:55

🔥 AutoDev: Automated AI-Driven Development

💡 The paper introduces AutoDev, an automated AI-driven software development framework that automates complex engineering tasks within a secure environment. The problem addressed is that existing AI-powered assistants for software development, such as GitHub Copilot, have limited capabilities, primarily focusing on suggesting code snippets and file manipulation within a chat-based interface. They do not leverage the full potential of an integrated development environment, which includes building, testing, executing code, and git operations.

To fill this gap, AutoDev is designed for autonomous planning and execution of intricate software engineering tasks. It enables users to define complex software engineering objectives, which are assigned to autonomous AI agents to achieve. These AI agents can perform diverse operations on a codebase, including file editing, retrieval, build processes, execution, testing, and git operations. They also have access to files, compiler output, build and testing logs, static analysis tools, and more, allowing them to execute tasks in a fully automated manner with a comprehensive understanding of the contextual information required.

AutoDev establishes a secure development environment by confining all operations within Docker containers, incorporating guardrails to ensure user privacy and file security. Users can define specific permitted or restricted commands and operations within AutoDev, allowing for a high degree of control over the development process.

The evaluation of AutoDev was conducted using the HumanEval dataset, with promising results. The framework achieved 91.5% and 87.8% Pass@1 for code generation and test generation respectively, demonstrating its effectiveness in automating software engineering tasks while maintaining a secure and user-controlled development environment. Overall, AutoDev contributes to the field of software development by providing a fully automated AI-driven framework that can perform a wide range of tasks, from code generation to testing and deployment, within a secure and controlled environment.

📅 Published on Mar 13, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2403.08299
• PDF: https://arxiv.org/pdf/2403.08299
• GitHub: https://github.com/vxcontrol/pentagi ⭐ 16.4k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AIDrivenDevelopment #AutomatedSoftwareEngineering #AIIntegratedDevelopment #AutoDevFramework #AIpoweredCodingAssistants

AutoDev: Automated AI-Driven Development

The landscape of software development has witnessed a paradigm shift with the advent of AI-powered assistants, exemplified by GitHub Copilot. However, existing solutions are not leveraging all the...

❤1

389 views22:55

303 views00:55

🔥 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

💡 The paper introduces SmolDocling, a compact vision-language model designed for end-to-end document conversion. The model aims to process entire pages and generate a new universal markup format called DocTags, which captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models or ensemble solutions, SmolDocling offers a single end-to-end conversion model with 256M parameters. This approach allows for accurately capturing content, structure, and spatial location of document elements.

The model is trained to reproduce document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types, including business documents, academic papers, technical reports, patents, and forms. The authors also contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.

Experimental results demonstrate that SmolDocling performs competitively with other vision language models that are up to 27 times larger in size, while reducing computational requirements substantially. The model's compact size and robust performance make it a significant contribution to the field of document conversion. The authors plan to make the model and datasets publicly available, which will facilitate further research and development in this area. Overall, SmolDocling offers a efficient and effective solution for end-to-end document conversion, with potential applications in various industries and domains.

SmolDocling: An ultra-compact vision-language model for end-to-end...

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal...

428 views00:55

346 views02:56

🔥 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

💡 The paper introduces OpenDevin, a platform for developing artificial intelligence agents that can interact with the world in ways similar to human developers. The platform allows AI agents to write code, use command lines, and browse the web, with support for multiple agents and evaluation benchmarks. The goal of OpenDevin is to provide a flexible and powerful platform for AI agents to interact with their environment, similar to how human developers use software to interact with the world.

The method used to develop OpenDevin involves creating a platform that can support the implementation of new agents, safe interaction with sandboxed environments for code execution, and coordination between multiple agents. The platform also incorporates evaluation benchmarks to assess the performance of the agents. The development of OpenDevin is a community effort, with contributions from over 160 contributors, and is released under the permissive MIT license.

The results of the paper include the evaluation of agents over 15 challenging tasks, including software engineering and web browsing. The evaluation benchmarks demonstrate the flexibility and power of the OpenDevin platform in supporting AI agents that can interact with the world in complex ways. The paper also highlights the potential of OpenDevin to improve over time, with ongoing contributions from the community. Overall, the paper contributes to the development of AI agents that can interact with the world in ways similar to human developers, with potential applications in a wide range of areas, including software engineering and web development.

📅 Published on Jul 23, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2407.16741
• PDF: https://arxiv.org/pdf/2407.16741
• GitHub: https://github.com/opendevin/opendevin ⭐ 72.6k

📊 Datasets citing this paper:
• https://huggingface.co/datasets/GloriaaaM/LLM-Agent-Harness-Survey
• https://huggingface.co/datasets/namanvats/harbor-goose-openhands-benchmark
• https://huggingface.co/datasets/antieval/aware-bench-trajectories

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #AIPlatformDevelopment #GeneralistAgents #HumanCentricAI #AIEnvironmentInteraction

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to...

❤1

452 views02:56

378 views04:56

🔥 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

💡 The paper proposes PaddleOCR-VL, a state-of-the-art and resource-efficient model for document parsing. The problem addressed is the need for a model that can accurately recognize elements in documents, such as text, tables, formulas, and charts, while being efficient in terms of resource consumption. To solve this problem, the authors propose a vision-language model that combines a NaViT-style dynamic resolution visual encoder with the ERNIE language model. The resulting model, PaddleOCR-VL-0.9B, is a compact yet powerful model that can support 109 languages and recognize complex elements with high accuracy. The method used to achieve this is the integration of the visual encoder and language model, which enables the model to efficiently process documents and recognize elements. The results show that PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition, outperforming existing solutions and exhibiting strong competitiveness against top-tier vision-language models. The model also delivers fast inference speeds, making it highly suitable for practical deployment in real-world scenarios. The code for the model is available, making it accessible for further research and development. Overall, the paper contributes a highly efficient and accurate model for document parsing, which can be used in a variety of applications.

📅 Published on Oct 16, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR ⭐ 77.1k

🤖 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/unsloth/PaddleOCR-VL

📊 Datasets citing this paper:
• https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B...

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model...

❤3

451 views04:56

🔥 LTX-2: Efficient Joint Audio-Visual Foundation Model

💡 The paper introduces LTX-2, an open-source audiovisual diffusion model that generates synchronized video and audio content. The problem addressed is that current text-to-video diffusion models can generate compelling video sequences but lack the semantic, emotional, and atmospheric cues that audio provides. To solve this, the authors propose a dual-stream transformer architecture with cross-modal attention and classifier-free guidance. The model consists of a 14 billion parameter video stream and a 5 billion parameter audio stream, coupled through bidirectional audio-video cross-attention layers. This architecture enables efficient training and inference of a unified audiovisual model, with more capacity allocated for video generation than audio generation.

The method used to achieve this includes employing a multilingual text encoder for broader prompt understanding and introducing a modality-aware classifier-free guidance mechanism for improved audiovisual alignment and controllability. The model is trained using a combination of video and audio data, with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning.

The results show that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. The model is capable of generating high-quality, temporally synchronized audiovisual content, including rich and coherent audio tracks that follow the characters, environment, style, and emotion of each scene. The model weights and code are publicly released, making it accessible for further research and development. Overall, LTX-2 provides a significant contribution to the field of audiovisual generation, enabling the creation of more realistic and engaging multimedia content.

📅 Published on Jan 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2601.03233
• PDF: https://arxiv.org/pdf/2601.03233
• Project Page: https://app.ltx.studio/ltx-2-playground/i2v
• GitHub: https://github.com/Lightricks/LTX-2 ⭐ 6.4k

🤖 Models citing this paper:
• https://huggingface.co/Lightricks/LTX-2
• https://huggingface.co/Lightricks/LTX-2.3
• https://huggingface.co/unsloth/LTX-2.3-GGUF

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/linoyts/LTX-2-3-First-Last-Frame
• https://huggingface.co/spaces/linoyts/LTX-2-3-sync
• https://huggingface.co/spaces/linoyts/LTX-2-3-outpaint

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualLearning #MultimodalDiffusionModels #CrossModalAttention #AudioVideoGeneration #JointFoundationModels

LTX-2: Efficient Joint Audio-Visual Foundation Model

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce...

❤3

498 views06:56

This media is not supported in your browser

0:26

VIEW IN TELEGRAM

👆

Please open Telegram to view this post

VIEW IN TELEGRAM

452 viewsedited 06:56

360 views08:56

🔥 LightRAG: Simple and Fast Retrieval-Augmented Generation

💡 The paper introduces LightRAG, a novel approach to improve Retrieval-Augmented Generation systems, which enhance large language models by integrating external knowledge sources. Existing systems have limitations, including reliance on flat data representations and inadequate contextual awareness, leading to fragmented answers that fail to capture complex inter-dependencies. To address these challenges, LightRAG incorporates graph structures into text indexing and retrieval processes, employing a dual-level retrieval system that enhances comprehensive information retrieval from both low-level and high-level knowledge discovery. The integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. An incremental update algorithm ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. The experimental results demonstrate considerable improvements in retrieval accuracy and efficiency compared to existing approaches, making LightRAG a significant contribution to the field of Retrieval-Augmented Generation. The authors have made LightRAG open-source, making it available for further development and application. Overall, LightRAG provides a simple and fast retrieval-augmented generation approach that achieves better accuracy and response times, making it a valuable tool for data science applications.

📅 Published on Oct 8, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2410.05779
• PDF: https://arxiv.org/pdf/2410.05779
• GitHub: https://github.com/hkuds/lightrag ⭐ 34.7k
• Project Page: https://huggingface.co/Neha12210/project2-advanced-rag

🤖 Models citing this paper:
• https://huggingface.co/muthuk1/graphrag-inference-hackathon
• https://huggingface.co/atad-tokyo/GST_LIVING_NOVEL
• https://huggingface.co/Neha12210/project2-advanced-rag

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/rm-lht/lightrag

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RetrievalAugmentedGeneration #GraphBasedInformationRetrieval #KnowledgeDiscoverySystems #LargeLanguageModels #TextIndexingTechniques

LightRAG: Simple and Fast Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to...

👍1

448 views08:56

358 views10:56

🔥 RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

💡 The paper introduces RF-DETR, a lightweight detection transformer that uses neural architecture search to optimize accuracy and latency for real-time object detection. The motivation behind this work is that current state-of-the-art detectors often fail to generalize to real-world datasets with classes not seen during pre-training. Instead of fine-tuning a large vision-language model, the authors propose RF-DETR, which fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations to find the best accuracy-latency tradeoff. The approach uses weight-sharing neural architecture search to improve transferability to diverse target domains. The results show that RF-DETR significantly outperforms prior state-of-the-art real-time methods on several datasets, including COCO and Roboflow100-VL. Specifically, RF-DETR achieves 48.0 AP on COCO, beating a similar method by 5.3 AP at similar latency, and RF-DETR also outperforms another method on Roboflow100-VL while running 20 times as fast. Notably, RF-DETR is the first real-time detector to surpass 60 AP on COCO, demonstrating its effectiveness in achieving high accuracy and low latency. The code for RF-DETR is made available, allowing for further research and development in the field of real-time object detection.

📅 Published on Nov 12, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2511.09554
• PDF: https://arxiv.org/pdf/2511.09554
• Project Page: https://rfdetr.roboflow.com/1.3.0/
• GitHub: https://github.com/roboflow/rf-detr ⭐ 6.9k

🤖 Models citing this paper:
• https://huggingface.co/stevenbucaille/rf-detr-small
• https://huggingface.co/stevenbucaille/rf-detr-nano
• https://huggingface.co/stevenbucaille/rf-detr-base

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/arihant3704/rf-detr-playground

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RealTimeObjectDetection #NeuralArchitectureSearch #DetectionTransformers #WeightSharingNAS #EfficientComputerVision

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training....

❤1

454 views10:56