AI & ML Papers

Recursive Language Models

We study allowing large language models (LLMs) to process arbitrarily long prompts through the lens of inference-time scaling. We propose Recursive Language Models (RLMs), a general inference...

❤3

420 views10:57

🔥 EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning

💡 The paper introduces EverMemOS, a self-organizing memory operating system designed to enhance the long-term interaction capabilities of large language models. The problem addressed is that current large language models have limited context windows, making it difficult to sustain coherent behavior over extended interactions. Existing memory systems store isolated records and retrieve fragments, which limits their ability to consolidate evolving user states and resolve conflicts.

The method proposed by EverMemOS involves an engram-inspired lifecycle for computational memory, which includes three main components: Episodic Trace Formation, Semantic Consolidation, and Reconstructive Recollection. Episodic Trace Formation converts dialogue streams into memory cells that capture episodic traces, atomic facts, and time-bounded foresight signals. Semantic Consolidation organizes these memory cells into thematic scenes, distilling stable semantic structures and updating user profiles. Reconstructive Recollection performs scene-guided agentic retrieval to compose the necessary and sufficient context for downstream reasoning.

The results show that EverMemOS achieves state-of-the-art performance on memory-augmented reasoning tasks, as demonstrated by experiments on LoCoMo and LongMemEval. Additionally, a profile study on PersonaMem v2 and qualitative case studies illustrate the chat-oriented capabilities of EverMemOS, such as user profiling and foresight. The code for EverMemOS is available, making it possible for others to build upon and extend this work. Overall, the paper presents a significant contribution to the development of large language models, enabling them to engage in more coherent and effective long-term interactions.

📅 Published on Jan 5

🔗 Links:
• arXiv: https://arxiv.org/abs/2601.02163
• PDF: https://arxiv.org/pdf/2601.02163
• GitHub: https://github.com/EverMind-AI/EverMemOS ⭐ 4.4k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SelfOrganizingMemory #LongHorizonReasoning #LargeLanguageModels #MemoryOperatingSystem #StructuredReasoning

EverMemOS: A Self-Organizing Memory Operating System for...

Large Language Models (LLMs) are increasingly deployed as long-term interactive agents, yet their limited context windows make it difficult to sustain coherent behavior over extended interactions....

391 views12:57

🔥 DFlash: Block Diffusion for Flash Speculative Decoding

💡 The paper introduces DFlash, a speculative decoding framework designed to improve the speed of large language models while maintaining their quality. The problem with current large language models is that they require sequential decoding, which leads to high latency and poor GPU utilization. Speculative decoding has been proposed as a solution, where a fast draft model generates outputs that are then verified in parallel by the target model. However, existing speculative decoding methods still rely on sequential drafting, which limits their speedup.

To address this, the authors propose using a lightweight block diffusion model for parallel drafting. This model generates draft tokens in a single forward pass and conditions the draft model on context features extracted from the target model. The result is a framework that enables efficient drafting with high-quality outputs and higher acceptance rates.

The experiments show that DFlash achieves significant speedup over existing autoregressive methods, with over 6x lossless acceleration across a range of models and tasks. This is up to 2.5x higher speedup than the state-of-the-art speculative decoding method. The method contributes to improving the efficiency of large language models, making them more suitable for practical applications. Overall, DFlash offers a promising solution for speeding up large language models without sacrificing their performance.

📅 Published on Feb 5

🔗 Links:
• arXiv: https://arxiv.org/abs/2602.06036
• PDF: https://arxiv.org/pdf/2602.06036
• Project Page: https://z-lab.ai/projects/dflash/
• GitHub: https://github.com/z-lab/dflash ⭐ 3.1k

🤖 Models citing this paper:
• https://huggingface.co/z-lab/Qwen3.6-27B-DFlash
• https://huggingface.co/z-lab/Qwen3.6-35B-A3B-DFlash
• https://huggingface.co/z-lab/Qwen3.5-27B-DFlash

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Jackrong/qwen36-eval

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpeculativeDecoding #BlockDiffusionModels #LargeLanguageModels #ParallelDecodingTechniques #FlashSpeculativeDecoding

DFlash: Block Diffusion for Flash Speculative Decoding

Autoregressive large language models (LLMs) deliver strong performance but require inherently sequential decoding, leading to high inference latency and poor GPU utilization. Speculative decoding...

285 views04:58

🔥 Adam's Law: Textual Frequency Law on Large Language Models

💡 The paper proposes a novel framework to improve large language model performance through textual frequency analysis. The authors argue that textual frequency, which is the frequency of certain words or phrases in a language, is relevant to human cognition and can also be applied to large language models. However, this topic has been understudied in the context of large language models.

The proposed framework consists of three main components. First, the authors introduce the Textual Frequency Law, which states that frequent textual data should be preferred for large language models, both for prompting and fine-tuning. To estimate the sentence-level frequency, the authors use online resources, as many large language models are closed-source in their training data. They also utilize an input paraphraser to paraphrase the input into a more frequent textual expression.

The second component is Textual Frequency Distillation, which involves querying large language models to conduct story completion by extending sentences in the datasets. The resulting corpora are used to adjust the initial estimation of textual frequency.

The third component is Curriculum Textual Frequency Training, which fine-tunes large language models in an increasing order of sentence-level frequency. This means that the models are first trained on the most frequent sentences and then gradually moved to less frequent ones.

The authors conducted experiments on a curated dataset called Textual Frequency Paired Dataset, which covers tasks such as math reasoning, machine translation, commonsense reasoning, and agentic tool calling. The results show that the proposed framework is effective in improving large language model performance.

Overall, the paper contributes to the understanding of textual frequency in large language models and provides a novel framework for improving their performance. The proposed framework has the potential to be applied to various natural language processing tasks and can lead to more efficient and effective large language models.

📅 Published on Apr 2

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.02176
• PDF: https://arxiv.org/pdf/2604.02176
• GitHub: https://github.com/HongyuanLuke/frequencylaw ⭐ 658

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Akaashiiii/TFPD

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AdamSLaw #TextualFrequencyAnalysis #LargeLanguageModels #NaturalLanguageProcessing #LanguageModelOptimization

Adam's Law: Textual Frequency Law on Large Language Models

While textual frequency has been validated as relevant to human cognition in reading speed, its relatedness to Large Language Models (LLMs) is seldom studied. We propose a novel research direction...

❤2

521 views05:00

🔥 QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading

💡 The paper introduces QuantAgent, a multi-agent large language model framework designed specifically for high-frequency trading. High-frequency trading requires rapid and precise decisions based on short-term market signals, which is different from traditional financial applications that involve long-term semantic reasoning. Existing large language models are not well-suited for high-frequency trading due to their lack of structured reasoning capabilities and domain-specific tools.

To address this problem, the QuantAgent framework decomposes trading into four specialized agents: Indicator, Pattern, Trend, and Risk. Each agent is equipped with domain-specific tools and structured reasoning capabilities to capture distinct aspects of market dynamics over short temporal windows. The Indicator agent focuses on technical indicators, the Pattern agent focuses on chart patterns, the Trend agent focuses on trend-based features, and the Risk agent focuses on risk management.

The results show that QuantAgent outperforms strong neural and rule-based baselines in terms of predictive accuracy and cumulative return over 4-hour trading intervals. The evaluation was conducted across ten financial instruments, including Bitcoin and Nasdaq futures, using zero-shot evaluations. The findings suggest that combining structured financial priors with language-native reasoning can unlock new potential for real-time decision systems in high-frequency financial markets.

The main contribution of the paper is the introduction of a multi-agent large language model framework that is specifically designed for high-frequency trading. The framework's ability to decompose trading into specialized agents and leverage domain-specific tools and structured reasoning capabilities makes it well-suited for the high-speed and precision-critical demands of high-frequency trading. The results demonstrate the effectiveness of the QuantAgent framework and highlight its potential for use in real-world high-frequency trading applications.

📅 Published on Sep 12, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.09995
• PDF: https://arxiv.org/pdf/2509.09995
• Project Page: https://Y-Research-SBU.github.io/QuantAgent/
• GitHub: https://github.com/Y-Research-SBU/QuantAgent ⭐ 2.5k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#HighFrequencyTrading #MultiAgentSystems #LargeLanguageModels #FinancialMachineLearning #AlgorithmicTrading

QuantAgent: Price-Driven Multi-Agent LLMs for High-Frequency Trading

Recent advances in Large Language Models (LLMs) have shown remarkable capabilities in financial reasoning and market understanding. Multi-agent LLM frameworks such as TradingAgent and FINMEM...

❤3👍2

847 views09:36

🔥 LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

💡 The paper proposes a novel approach to improve the performance of large language models through test-time scaling, which involves allocating additional computation during inference. Existing test-time scaling strategies are typically hand-crafted, relying on manual design and tuning of reasoning patterns and heuristics. This approach leaves much of the computation-allocation space unexplored, resulting in potential inefficiencies.

To address this limitation, the authors introduce AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies. Instead of designing individual strategies, researchers can create environments where optimal strategies can be discovered automatically. The key to AutoTTS lies in constructing a discovery environment that provides a tractable control space and frequent, low-cost feedback for strategy search.

The authors formulate test-time scaling as a controller synthesis problem over pre-collected reasoning trajectories and probe signals. In this framework, controllers decide when to branch, continue, probe, prune, or stop, and can be evaluated cheaply without requiring repeated calls to the language model. To make the search tractable, the authors introduce beta parameterization, which enables fine-grained execution trace feedback to improve discovery efficiency.

The proposed approach is evaluated on mathematical reasoning benchmarks, where the discovered strategies demonstrate improved accuracy-cost tradeoffs over strong manually designed baselines. The discovered strategies also generalize to held-out benchmarks and model scales, indicating their robustness and flexibility. Notably, the entire discovery process incurs a relatively low cost of 39.9 dollars and 160 minutes, making it a practical and efficient solution.

Overall, the paper contributes a novel framework for automating test-time scaling strategy discovery, which has the potential to improve the performance of large language models while reducing the need for manual design and tuning. The authors also make their data and code available, facilitating further research and development in this area.

📅 Published on May 8

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.08083
• PDF: https://arxiv.org/pdf/2605.08083
• Project Page: https://zhengkid.github.io/AutoTTS-web/
• GitHub: https://github.com/zhengkid/AutoTTS ⭐ 43

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #TestTimeScaling #AgenticDiscovery #AutomatedReasoning #LanguageModelOptimization

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Test-time scaling (TTS) has become an effective approach for improving large language model performance by allocating additional computation during inference. However, existing TTS strategies are...

❤3

520 views21:49

🔥 UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

💡 The paper introduces UniPrefill, a universal prefill acceleration framework designed to improve the inference efficiency of long-context processing in large language models. The problem addressed is that existing prefill acceleration methods are limited to specific model architectures and suffer performance degradation when applied to emerging architectures. Additionally, these methods are often incompatible with continuous batching, making it difficult to integrate them into modern inference engines.

The proposed UniPrefill framework overcomes these limitations by directly accelerating the model's computation at the token level, making it applicable to virtually any model architecture. UniPrefill is implemented as a continuous batching operator and is integrated into the vLLM inference engine, enabling seamless support for prefill-decode co-processing and tensor parallelism.

The results show that UniPrefill achieves significant speedup, with up to 2.1x improvement in Time-To-First-Token, and the acceleration becomes more pronounced as the number of concurrent requests grows. This makes UniPrefill a valuable contribution to the field, enabling more efficient and scalable long-context processing in large language models.

📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06221
• PDF: https://arxiv.org/pdf/2605.06221
• GitHub: https://github.com/qhfan/UniPrefill ⭐ 22

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LongContextProcessing #PrefillAcceleration #DynamicSparsification #LargeLanguageModels #BlockWiseOptimization

UniPrefill: Universal Long-Context Prefill Acceleration via...

As large language models (LLMs) continue to advance rapidly, they are becoming increasingly capable while simultaneously demanding ever-longer context lengths. To improve the inference efficiency...

❤4

749 views21:49

🔥 δ-mem: Efficient Online Memory for Large Language Models

💡 The paper proposes a lightweight memory mechanism called delta-mem to enhance large language models by providing a compact online state of associative memory. The problem addressed is the need for large language models to accumulate and reuse historical information in long-term assistants and agent systems, which is challenging due to the high cost of expanding the context window and ineffective context utilization.

The proposed method, delta-mem, augments a frozen full-attention backbone with a compact online state that compresses past information into a fixed-size state matrix updated by delta-rule learning. This online state is used to generate low-rank corrections to the backbone's attention computation during generation, allowing for efficient online memory.

The results show that delta-mem improves the average score of the frozen backbone and achieves larger gains on memory-heavy benchmarks, such as MemoryAgentBench and LoCoMo, while preserving general capabilities. Notably, delta-mem achieves these results with only an 8x8 online memory state, demonstrating that effective memory can be realized through a compact online state directly coupled with attention computation, without requiring full fine-tuning, backbone replacement, or explicit context extension. Overall, the paper contributes a novel and efficient approach to enhancing large language models with online memory, which has the potential to improve performance in a range of applications.

📅 Published on May 12

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.12357
• PDF: https://arxiv.org/pdf/2605.12357
• GitHub: https://github.com/declare-lab/delta-Mem ⭐ 46

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#LargeLanguageModels #AssociativeMemoryMechanisms #EfficientOnlineLearning #DeltaRuleLearning #CompactStateRepresentations

$δ$-mem: Efficient Online Memory for Large Language Models

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to...

590 views13:50

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

🔥 Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

💡 The paper introduces Orthrus, a dual architecture framework that combines the strengths of autoregressive large language models and diffusion models to achieve fast parallel token generation while maintaining exact inference fidelity. The problem with standard autoregressive decoding is that it is sequential, which represents a fundamental bottleneck for high throughput inference. Diffusion language models try to address this issue with parallel generation, but they suffer from performance degradation, high training costs, and lack of convergence guarantees.

The Orthrus framework resolves this issue by augmenting a frozen large language model with a lightweight trainable module to create a parallel diffusion view alongside the standard autoregressive view. Both views attend to the same high fidelity key value cache, where the autoregressive head executes context pre filling to construct accurate key value representations, and the diffusion head executes parallel generation. The framework employs an exact consensus mechanism between the two views to guarantee lossless inference.

The results show that Orthrus delivers a speedup of up to 7.8 times with only a constant memory cache overhead and minimal parameter additions. This is achieved by sharing key value caches and using a consensus mechanism, which allows the framework to maintain exact inference fidelity while generating tokens in parallel. Overall, the Orthrus framework provides a simple and efficient solution to the problem of slow sequential decoding in autoregressive large language models, and it has the potential to be seamlessly integrated into existing transformer architectures.

📅 Published on May 12

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.12825
• PDF: https://arxiv.org/pdf/2605.12825

🤖 Models citing this paper:
• https://huggingface.co/chiennv/Orthrus-Qwen3-8B
• https://huggingface.co/chiennv/Orthrus-Qwen3-4B
• https://huggingface.co/chiennv/Orthrus-Qwen3-1.7B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionLanguageModels #ParallelTokenGeneration #AutoregressiveDecoding #DualViewDiffusion #LargeLanguageModels

GitHub

Hugging Face

672 views11:48

The AI community building the future. Hugging Face has 438 repositories available. Follow their code on GitHub.

🔥 DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

💡 The paper introduces DataFlow, a framework for unified data preparation and workflow automation in the context of large language models. The problem addressed is the current lack of scalable and reliable data preparation pipelines, which are often dominated by ad-hoc scripts and loosely specified workflows, hindering reproducibility and model performance.

To address this challenge, the authors propose DataFlow, a framework that provides system-level abstractions for modular, reusable, and composable data transformations. It includes a PyTorch-style pipeline construction API and nearly 200 reusable operators, as well as six domain-general pipelines for various tasks such as text, mathematical reasoning, and code.

The framework also includes DataFlow-Agent, which can automatically translate natural-language specifications into executable pipelines. This is achieved through operator synthesis, pipeline planning, and iterative verification.

The results show that DataFlow consistently improves downstream large language model performance across six representative use cases. The framework outperforms curated human datasets and specialized synthetic baselines, achieving significant gains in execution accuracy and average improvements on code benchmarks.

For example, the math, code, and text pipelines achieve up to 3 percent execution accuracy in Text-to-SQL, 7 percent average improvements on code benchmarks, and 1-3 point gains on math benchmarks. Additionally, a unified dataset produced by DataFlow enables base models to surpass counterparts trained on larger datasets.

Overall, the paper demonstrates that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable large language model data preparation, and establishes a system-level foundation for future data-centric AI development.

📅 Published on Dec 18, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2512.16676
• PDF: https://arxiv.org/pdf/2512.16676
• Project Page: https://github.com/OpenDCAI/DataFlow

📊 Datasets citing this paper:
• https://huggingface.co/datasets/OpenDCAI/dataflow-demo-Text2SQL
• https://huggingface.co/datasets/OpenDCAI/dataflow-mm-context_vqa
• https://huggingface.co/datasets/OpenDCAI/dataflow-instruct-10k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DataCentricAI #LLMDrivenFrameworks #UnifiedDataPreparation #WorkflowAutomation #LargeLanguageModels

GitHub

Hugging Face

❤2

888 views23:48