AI & ML Papers

Channel name was changed to «AI & ML Papers»

07:53

AI & ML Papers

Channel photo updated

07:58

785 views08:53

741 views09:36

🔥 SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

💡 The paper introduces a new post-training method called SOAR for diffusion models, which addresses the gap between supervised fine-tuning and reinforcement learning. Currently, supervised fine-tuning optimizes the denoiser only on ground-truth states, but once inference deviates from these ideal states, it relies on out-of-distribution generalization rather than learned correction, leading to exposure bias. Reinforcement learning can address this mismatch, but its terminal reward signal is sparse and suffers from credit-assignment difficulty.

SOAR proposes a bias-correction post-training method that fills this gap by providing dense, reward-free supervision through self-correction mechanisms. The method starts from a real sample, performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. This approach is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem.

The results show that SOAR improves the performance of diffusion models on various tasks, including image and text generation. On the SD3.5-Medium dataset, SOAR improves the GenEval score from 0.70 to 0.78 and the OCR score from 0.64 to 0.67 over supervised fine-tuning. Additionally, SOAR surpasses the performance of Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. The paper concludes that SOAR can directly replace supervised fine-tuning as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent reinforcement learning alignment.

📅 Published on Apr 14

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.12617
• PDF: https://arxiv.org/pdf/2604.12617
• Project Page: https://hy-soar.github.io/
• GitHub: https://github.com/Tencent-Hunyuan/HY-SOAR ⭐ 350

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionModels #SelfCorrectionTechniques #OptimalAlignmentMethods #RefinementInAI #PostTrainingMethods

arXiv.org

SOAR: Self-Correction for Optimal Alignment and Refinement in...

The post-training pipeline for diffusion models currently has two stages: supervised fine-tuning (SFT) on curated data and reinforcement learning (RL) with reward models. A fundamental gap...

692 views09:36

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

453 views09:37

AI & ML Papers

Photo

🔥 TradingAgents: Multi-Agents LLM Financial Trading Framework

💡 The paper introduces TradingAgents, a multi-agent framework that utilizes large language models for stock trading, simulating the collaborative dynamics of real-world trading firms. The framework consists of various agents, including fundamental analysts, sentiment analysts, technical analysts, and traders with different risk profiles, all powered by large language models. These agents work together to assess market conditions, manage risk, and make informed trading decisions. The framework also includes researcher agents that evaluate market conditions and a risk management team that monitors exposure.

The authors propose this framework as a solution to the limitations of existing single-agent systems and multi-agent frameworks that gather data independently. By simulating a dynamic and collaborative trading environment, TradingAgents aims to improve trading performance metrics such as cumulative returns and Sharpe ratio.

The results of the experiments show that the TradingAgents framework outperforms baseline models, with significant improvements in cumulative returns, Sharpe ratio, and maximum drawdown. The framework is made available to the public, demonstrating the potential of multi-agent large language model frameworks in financial trading. Overall, the paper contributes to the development of more sophisticated and collaborative trading systems, inspired by the dynamics of real-world trading firms.

📅 Published on Dec 28, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2412.20138
• PDF: https://arxiv.org/pdf/2412.20138
• GitHub: https://github.com/tauricresearch/tradingagents ⭐ 66.0k

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/shanghengdu/LLM-Agent-Optimization-PaperList
• https://huggingface.co/spaces/tahp0604/ai-stock-watchlist
• https://huggingface.co/spaces/Ervin2077/qiu

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiAgentSystems #LargeLanguageModels #FinancialTrading #ArtificialIntelligenceInFinance #AgentBasedModeling

arXiv.org

TradingAgents: Multi-Agents LLM Financial Trading Framework

Significant progress has been made in automated problem-solving using societies of agents powered by large language models (LLMs). In finance, efforts have largely focused on single-agent systems...

550 views09:37

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

443 views10:46

AI & ML Papers

Photo

🔥 VibeVoice Technical Report

💡 The VibeVoice Technical Report presents a novel model for synthesizing long-form multi-speaker speech. The problem addressed is the need for a method that can efficiently and effectively generate high-quality long-form speech with multiple speakers. To solve this problem, the authors propose a method called next-token diffusion, which is a unified approach for modeling continuous data by generating latent vectors via diffusion.

The authors introduce a novel continuous speech tokenizer that significantly improves data compression and computational efficiency. This tokenizer achieves an 80 times improvement in data compression compared to the popular Encodec model while maintaining comparable performance. The tokenizer preserves audio fidelity and enables the efficient processing of long sequences.

The results of the VibeVoice model are impressive, with the ability to synthesize long-form speech for up to 90 minutes with a maximum of 4 speakers. The model captures the authentic conversational tone and surpasses open-source and proprietary dialogue models. The VibeVoice model achieves superior performance and fidelity, making it a significant contribution to the field of speech synthesis. Overall, the VibeVoice Technical Report presents a novel and efficient approach to synthesizing high-quality long-form multi-speaker speech.

📅 Published on Aug 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.19205
• PDF: https://arxiv.org/pdf/2508.19205
• Project Page: https://microsoft.github.io/VibeVoice/
• GitHub: https://github.com/microsoft/VibeVoice ⭐ 46.4k

🤖 Models citing this paper:
• https://huggingface.co/microsoft/VibeVoice-1.5B
• https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B
• https://huggingface.co/aoi-ot/VibeVoice-Large

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/ChaitanyaChandra/VibeVoice
• https://huggingface.co/spaces/lths/VibeVoice-Demo
• https://huggingface.co/spaces/vibingvoice/vibe-voice-custom-voices

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpeechSynthesis #MultiSpeakerModeling #DiffusionBasedModeling #ContinuousSpeechTokenization #LatentVectorGeneration

arXiv.org

VibeVoice Technical Report

This report presents VibeVoice, a novel model designed to synthesize long-form speech with multiple speakers by employing next-token diffusion, which is a unified method for modeling continuous...

464 views10:46

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

397 views10:55

AI & ML Papers

Photo

🔥 Kronos: A Foundation Model for the Language of Financial Markets

💡 The paper introduces Kronos, a pre-training framework for financial K-line data that outperforms existing models in forecasting and synthetic data generation. The problem addressed is that current time series foundation models often underperform non-pre-trained architectures when applied to financial candlestick data and overlook important tasks such as volatility prediction and synthetic data generation. To solve this, the authors propose a specialized tokenizer that converts continuous market information into token sequences, preserving price dynamics and trade activity patterns. Kronos is pre-trained using an autoregressive objective on a large dataset of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. The results show that Kronos excels in a zero-shot setting across various financial tasks, achieving a 93 percent improvement in price series forecasting over the leading time series foundation model and an 87 percent improvement over the best non-pre-trained baseline. Additionally, Kronos achieves a 9 percent lower mean absolute error in volatility forecasting and a 22 percent improvement in generative fidelity for synthetic K-line sequences. The pre-trained model is publicly available, establishing Kronos as a robust and versatile foundation model for end-to-end financial time series analysis.

📅 Published on Aug 2, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.02739
• PDF: https://arxiv.org/pdf/2508.02739
• GitHub: https://github.com/shiyu-coder/Kronos ⭐ 22.7k

🤖 Models citing this paper:
• https://huggingface.co/NeoQuasar/Kronos-base
• https://huggingface.co/NeoQuasar/Kronos-Tokenizer-base
• https://huggingface.co/NeoQuasar/Kronos-mini

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/yingfeng64/kronos-api
• https://huggingface.co/spaces/almascp/kronos-eurusd-dashboard
• https://huggingface.co/spaces/superyan/kronos-jp

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FinancialLanguageModels #KLineDataAnalysis #TimeSeriesForecasting #VolatilityPrediction #FinancialMarketModeling

arXiv.org

Kronos: A Foundation Model for the Language of Financial Markets

The success of large-scale pre-training paradigm, exemplified by Large Language Models (LLMs), has inspired the development of Time Series Foundation Models (TSFMs). However, their application to...

❤3

477 views10:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

457 views12:55

AI & ML Papers

Photo

🔥 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

💡 The paper introduces MinerU2.5, a 1.2 billion parameter vision-language model designed for efficient high-resolution document parsing. The model achieves state-of-the-art recognition accuracy while maintaining computational efficiency through a two-stage parsing strategy. In the first stage, the model performs layout analysis on downsampled images to identify structural elements, reducing computational overhead. In the second stage, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, the authors developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. The results demonstrate that MinerU2.5 achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead. Overall, the paper contributes a novel approach to document parsing that balances accuracy and efficiency, making it suitable for a wide range of applications.

📅 Published on Sep 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• GitHub: https://github.com/opendatalab/MinerU ⭐ 61.9k

🤖 Models citing this paper:
• https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
• https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/freakynit/MinerU2.5-2509-1.2B

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/xiaoye-winters/MinerU-API
• https://huggingface.co/spaces/opendatalab/MinerU-Diffusion-V1-0320-2.5B
• https://huggingface.co/spaces/Instantnewdesign/document_extract

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentParsing #VisionLanguageModel #HighResolutionImageProcessing #LayoutAnalysis #ContentRecognition

arXiv.org

MinerU2.5: A Decoupled Vision-Language Model for Efficient...

We introduce MinerU2.5, a 1.2B-parameter document parsing vision-language model that achieves state-of-the-art recognition accuracy while maintaining exceptional computational efficiency. Our...

❤4

552 views12:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

448 views14:55

AI & ML Papers

Photo

🔥 A decoder-only foundation model for time-series forecasting

💡 The paper introduces a novel approach to time-series forecasting using a decoder-only foundation model. The authors draw inspiration from recent advances in large language models for natural language processing and adapt this concept to time-series forecasting. The problem addressed is the ability to achieve accurate forecasting results without requiring task-specific training data, which is a common challenge in time-series forecasting.

The method employed involves pretraining a patched-decoder style attention model on a large time-series corpus. This model is designed to work well across different forecasting history lengths, prediction lengths, and temporal granularities, making it a versatile solution for various time-series forecasting tasks.

The results show that the proposed model achieves near-optimal zero-shot performance on a variety of public datasets, coming close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. This is a significant contribution, as it demonstrates the potential for a single model to perform well across diverse datasets without requiring task-specific fine-tuning. Overall, the paper presents a promising approach to time-series forecasting, leveraging the strengths of large language models to achieve accurate and flexible forecasting results.

📅 Published on Oct 14, 2023

🔗 Links:
• arXiv: https://arxiv.org/abs/2310.10688
• PDF: https://arxiv.org/pdf/2310.10688
• GitHub: https://github.com/google-research/timesfm ⭐ 19.4k

🤖 Models citing this paper:
• https://huggingface.co/google/timesfm-1.0-200m
• https://huggingface.co/google/timesfm-2.0-500m-pytorch
• https://huggingface.co/google/timesfm-2.5-200m-pytorch

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/bahadirtonguc/timesfm-forecaster
• https://huggingface.co/spaces/autogluon/fev-bench
• https://huggingface.co/spaces/JayLacoma/Trader_Technical_Indicators

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TimeSeriesForecasting #DecoderOnlyModels #FoundationModelsForForecasting #PatchedDecoderAttention #TimeSeriesAnalysis

arXiv.org

A decoder-only foundation model for time-series forecasting

Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on...

❤3

607 views14:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

456 views16:55

AI & ML Papers

Photo

🔥 Efficient Memory Management for Large Language Model Serving with PagedAttention

💡 The paper addresses the problem of efficient memory management for large language models, which is crucial for high throughput serving. Existing systems struggle with managing the key-value cache memory, which is huge and dynamically grows and shrinks, resulting in significant waste due to fragmentation and redundant duplication. To solve this problem, the authors propose PagedAttention, an attention algorithm inspired by classical virtual memory and paging techniques in operating systems. They also build vLLM, a large language model serving system that achieves near-zero waste in key-value cache memory and flexible sharing of the cache within and across requests. The vLLM system is designed to reduce memory usage and improve throughput. The authors evaluate vLLM and show that it improves the throughput of popular large language models by 2-4 times with the same level of latency compared to state-of-the-art systems. The improvement is more significant with longer sequences, larger models, and more complex decoding algorithms. Overall, the paper contributes to the development of efficient memory management for large language models, enabling higher throughput and better performance.

📅 Published on Sep 12, 2023

🔗 Links:
• arXiv: https://arxiv.org/abs/2309.06180
• PDF: https://arxiv.org/pdf/2309.06180
• GitHub: https://github.com/vllm-project/vllm ⭐ 79.0k

🤖 Models citing this paper:
• https://huggingface.co/theonlyengine/Flash-attention1
• https://huggingface.co/enfinity7B/apac

📊 Datasets citing this paper:
• https://huggingface.co/datasets/TheBlueScrubs/TheBlueScrubs-v1

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Vrushali777/vllm-inference-benchmark

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #EfficientMemoryManagement #PagedAttention #LanguageModelServing #KeyValueCacheOptimization

arXiv.org

Efficient Memory Management for Large Language Model Serving with...

High throughput serving of large language models (LLMs) requires batching sufficiently many requests at a time. However, existing systems struggle because the key-value cache (KV cache) memory for...

❤5

572 views16:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

419 views18:55

AI & ML Papers

Photo

🔥 RAG-Anything: All-in-One RAG Framework

💡 The paper introduces RAG-Anything, a unified framework that enhances multimodal knowledge retrieval by integrating cross-modal relationships and semantic matching. The problem addressed is that current Retrieval-Augmented Generation frameworks are limited to textual content, creating gaps when processing multimodal documents that contain a combination of text, images, tables, and mathematical expressions.

The proposed method, RAG-Anything, reconceptualizes multimodal content as interconnected knowledge entities, introducing dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. The framework develops cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching, enabling effective reasoning over heterogeneous content where relevant evidence spans multiple modalities.

The results show that RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. The performance gains are particularly pronounced on long documents where traditional approaches fail. The framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. The RAG-Anything framework is open-sourced, making it available for further development and application. Overall, the paper contributes to the development of a more comprehensive and effective knowledge retrieval system that can handle multimodal content.

📅 Published on Oct 14, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.12323
• PDF: https://arxiv.org/pdf/2510.12323
• GitHub: https://github.com/HKUDS/RAG-Anything ⭐ 19.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalKnowledgeRetrieval #CrossModalRelationships #RetrievalAugmentedGeneration #MultimodalDocumentProcessing #SemanticMatching

arXiv.org

RAG-Anything: All-in-One RAG Framework

Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists...

❤3

567 views18:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

442 views20:55

AI & ML Papers

Photo

🔥 Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

💡 The paper introduces Mem0, a memory-centric architecture designed to improve long-term conversational coherence in large language models. The problem addressed is that existing models have fixed context windows, making it challenging to maintain consistency over prolonged multi-session dialogues. To solve this, Mem0 dynamically extracts, consolidates, and retrieves salient information from ongoing conversations. The method uses a graph-based memory representation to capture complex relational structures among conversational elements.

The authors evaluated Mem0 against six baseline categories, including established memory-augmented systems, retrieval-augmented generation, and a full-context approach. The results show that Mem0 consistently outperforms all existing memory systems across four question categories, achieving a 26 percent relative improvement in the LLM-as-a-Judge metric over OpenAI. Additionally, Mem0 with graph memory achieves a higher overall score than the base configuration.

The method not only improves accuracy but also reduces computational overhead, with a 91 percent lower p95 latency and more than 90 percent token cost savings compared to the full-context method. The findings highlight the critical role of structured, persistent memory mechanisms for long-term conversational coherence, paving the way for more reliable and efficient large language model-driven AI agents. Overall, Mem0 offers a compelling balance between advanced reasoning capabilities and practical deployment constraints.

📅 Published on Apr 28, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2504.19413
• PDF: https://arxiv.org/pdf/2504.19413
• Project Page: https://mem0.ai/research
• GitHub: https://github.com/mem0ai/mem0 ⭐ 54.8k

📊 Datasets citing this paper:
• https://huggingface.co/datasets/GloriaaaM/LLM-Agent-Harness-Survey

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Pratham13/research-paper-finder

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongTermMemoryInAI #ConversationalCoherence #MemoryCentricArchitecture #GraphBasedMemoryRepresentation #ScalableLanguageModels

arXiv.org

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Large Language Models (LLMs) have demonstrated remarkable prowess in generating contextually coherent responses, yet their fixed context windows pose fundamental challenges for maintaining...

❤2

482 views20:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform