AI & ML Papers
32.9K subscribers
7.11K photos
531 videos
24 files
7.77K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
FlashRT: Towards Computationally and Memory Efficient Red-Teaming for Prompt Injection and Knowledge Corruption

📝 Summary:
FlashRT significantly enhances the efficiency of optimization-based prompt injection and knowledge corruption attacks for long-context LLMs. It delivers 2x-7x speedup and 2x-4x GPU memory reduction, enabling systematic and scalable security evaluations.

🔹 Publication Date: Published on Apr 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.28157
• PDF: https://arxiv.org/pdf/2604.28157
• Github: https://github.com/wang-yanting/FlashRT

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#AI #DataScience #MachineLearning #HuggingFace #Research
Step-level Optimization for Efficient Computer-use Agents

📝 Summary:
Computer-use agents are inefficient when using large models for every step. This paper proposes an event-driven cascade that uses small policies by default, escalating to stronger models only when lightweight monitors detect high risk like stalls or semantic drift, thereby optimizing compute.

🔹 Publication Date: Published on Apr 29

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.27151
• PDF: https://arxiv.org/pdf/2604.27151
• Github: https://github.com/yale-nlp/StepWise

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#AI #AgentSystems #ResourceOptimization #EfficientAI #AdaptiveSystems
SignRoundV2: Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

📝 Summary:
SignRoundV2 is a post-training quantization method for LLMs. It achieves competitive, near full-precision accuracy even at extremely low-bits like 2-bits. This is done via layer-wise bit allocation and pre-tuning scale search.

🔹 Publication Date: Published on Dec 4, 2025

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2512.04746
• PDF: https://arxiv.org/pdf/2512.04746
• Project Page: https://github.com/intel/auto-round
• Github: https://github.com/intel/auto-round

🔹 Models citing this paper:
https://huggingface.co/Intel/MiroThinker-v1.5-30B-gguf-q2ks-mixed-AutoRound
https://huggingface.co/Intel/DeepSeek-R1-0528-Qwen3-8B-int4-AutoRound

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#LLMs #Quantization #DeepLearning #AI #MachineLearning
Nemotron 3 Nano Omni: Efficient and Open Multimodal Intelligence

📝 Summary:
Nemotron 3 Nano Omni is a new efficient, open multimodal AI model. It natively supports audio, text, images, and video inputs, improving accuracy and efficiency over previous versions. It excels in document understanding and long audio-video comprehension.

🔹 Publication Date: Published on Apr 27

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.24954
• PDF: https://arxiv.org/pdf/2604.24954

🔹 Models citing this paper:
https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-BF16
https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-NVFP4
https://huggingface.co/nvidia/Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8

Spaces citing this paper:
https://huggingface.co/spaces/akhaliq/Nemotron-3-Nano-Omni
https://huggingface.co/spaces/developerjeremylive/Nemotron-3-Nano-Omni-etheroi

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#AI #MultimodalAI #DeepLearning #OpenSourceAI #AIResearch
2
DeepSeek-OCR: Contexts Optical Compression

📝 Summary:
DeepSeek-OCR compresses long contexts via optical 2D mapping to achieve high OCR precision with significantly reduced vision tokens. It shows 97% accuracy at 10x compression, outperforming other OCR models efficiently. This innovation holds practical value for document processing and LLM training...

🔹 Publication Date: Published on Oct 21, 2025

🔹 Paper Links:
• arXiv Page: https://arxivexplained.com/papers/deepseek-ocr-contexts-optical-compression
• PDF: https://arxiv.org/pdf/2510.18234
• Github: https://github.com/deepseek-ai/DeepSeek-OCR

🔹 Models citing this paper:
https://huggingface.co/deepseek-ai/DeepSeek-OCR
https://huggingface.co/deepseek-ai/DeepSeek-OCR-2
https://huggingface.co/unsloth/DeepSeek-OCR

Spaces citing this paper:
https://huggingface.co/spaces/merterbak/DeepSeek-OCR-Demo
https://huggingface.co/spaces/davidpcm/openclaw-stock-analyst
https://huggingface.co/spaces/khang119966/DeepSeek-OCR-DEMO

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#OCR #AI #DeepLearning #ContextCompression #LLM
4👍1
Channel name was changed to «AI & ML Papers»
Channel photo updated
AI & ML Papers
Photo
🔥 SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

💡 The paper introduces a new post-training method called SOAR for diffusion models, which addresses the gap between supervised fine-tuning and reinforcement learning. Currently, supervised fine-tuning optimizes the denoiser only on ground-truth states, but once inference deviates from these ideal states, it relies on out-of-distribution generalization rather than learned correction, leading to exposure bias. Reinforcement learning can address this mismatch, but its terminal reward signal is sparse and suffers from credit-assignment difficulty.

SOAR proposes a bias-correction post-training method that fills this gap by providing dense, reward-free supervision through self-correction mechanisms. The method starts from a real sample, performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. This approach is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem.

The results show that SOAR improves the performance of diffusion models on various tasks, including image and text generation. On the SD3.5-Medium dataset, SOAR improves the GenEval score from 0.70 to 0.78 and the OCR score from 0.64 to 0.67 over supervised fine-tuning. Additionally, SOAR surpasses the performance of Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. The paper concludes that SOAR can directly replace supervised fine-tuning as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent reinforcement learning alignment.


📅 Published on Apr 14

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.12617
• PDF: https://arxiv.org/pdf/2604.12617
• Project Page: https://hy-soar.github.io/
• GitHub: https://github.com/Tencent-Hunyuan/HY-SOAR 350

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionModels #SelfCorrectionTechniques #OptimalAlignmentMethods #RefinementInAI #PostTrainingMethods
AI & ML Papers
Photo
🔥 TradingAgents: Multi-Agents LLM Financial Trading Framework

💡 The paper introduces TradingAgents, a multi-agent framework that utilizes large language models for stock trading, simulating the collaborative dynamics of real-world trading firms. The framework consists of various agents, including fundamental analysts, sentiment analysts, technical analysts, and traders with different risk profiles, all powered by large language models. These agents work together to assess market conditions, manage risk, and make informed trading decisions. The framework also includes researcher agents that evaluate market conditions and a risk management team that monitors exposure.

The authors propose this framework as a solution to the limitations of existing single-agent systems and multi-agent frameworks that gather data independently. By simulating a dynamic and collaborative trading environment, TradingAgents aims to improve trading performance metrics such as cumulative returns and Sharpe ratio.

The results of the experiments show that the TradingAgents framework outperforms baseline models, with significant improvements in cumulative returns, Sharpe ratio, and maximum drawdown. The framework is made available to the public, demonstrating the potential of multi-agent large language model frameworks in financial trading. Overall, the paper contributes to the development of more sophisticated and collaborative trading systems, inspired by the dynamics of real-world trading firms.


📅 Published on Dec 28, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2412.20138
• PDF: https://arxiv.org/pdf/2412.20138
• GitHub: https://github.com/tauricresearch/tradingagents 66.0k

🚀 Spaces citing this paper:
https://huggingface.co/spaces/shanghengdu/LLM-Agent-Optimization-PaperList
https://huggingface.co/spaces/tahp0604/ai-stock-watchlist
https://huggingface.co/spaces/Ervin2077/qiu

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiAgentSystems #LargeLanguageModels #FinancialTrading #ArtificialIntelligenceInFinance #AgentBasedModeling
AI & ML Papers
Photo
🔥 VibeVoice Technical Report

💡 The VibeVoice Technical Report presents a novel model for synthesizing long-form multi-speaker speech. The problem addressed is the need for a method that can efficiently and effectively generate high-quality long-form speech with multiple speakers. To solve this problem, the authors propose a method called next-token diffusion, which is a unified approach for modeling continuous data by generating latent vectors via diffusion.

The authors introduce a novel continuous speech tokenizer that significantly improves data compression and computational efficiency. This tokenizer achieves an 80 times improvement in data compression compared to the popular Encodec model while maintaining comparable performance. The tokenizer preserves audio fidelity and enables the efficient processing of long sequences.

The results of the VibeVoice model are impressive, with the ability to synthesize long-form speech for up to 90 minutes with a maximum of 4 speakers. The model captures the authentic conversational tone and surpasses open-source and proprietary dialogue models. The VibeVoice model achieves superior performance and fidelity, making it a significant contribution to the field of speech synthesis. Overall, the VibeVoice Technical Report presents a novel and efficient approach to synthesizing high-quality long-form multi-speaker speech.


📅 Published on Aug 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.19205
• PDF: https://arxiv.org/pdf/2508.19205
• Project Page: https://microsoft.github.io/VibeVoice/
• GitHub: https://github.com/microsoft/VibeVoice 46.4k

🤖 Models citing this paper:
https://huggingface.co/microsoft/VibeVoice-1.5B
https://huggingface.co/microsoft/VibeVoice-Realtime-0.5B
https://huggingface.co/aoi-ot/VibeVoice-Large

🚀 Spaces citing this paper:
https://huggingface.co/spaces/ChaitanyaChandra/VibeVoice
https://huggingface.co/spaces/lths/VibeVoice-Demo
https://huggingface.co/spaces/vibingvoice/vibe-voice-custom-voices

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpeechSynthesis #MultiSpeakerModeling #DiffusionBasedModeling #ContinuousSpeechTokenization #LatentVectorGeneration
AI & ML Papers
Photo
🔥 Kronos: A Foundation Model for the Language of Financial Markets

💡 The paper introduces Kronos, a pre-training framework for financial K-line data that outperforms existing models in forecasting and synthetic data generation. The problem addressed is that current time series foundation models often underperform non-pre-trained architectures when applied to financial candlestick data and overlook important tasks such as volatility prediction and synthetic data generation. To solve this, the authors propose a specialized tokenizer that converts continuous market information into token sequences, preserving price dynamics and trade activity patterns. Kronos is pre-trained using an autoregressive objective on a large dataset of over 12 billion K-line records from 45 global exchanges, enabling it to learn nuanced temporal and cross-asset representations. The results show that Kronos excels in a zero-shot setting across various financial tasks, achieving a 93 percent improvement in price series forecasting over the leading time series foundation model and an 87 percent improvement over the best non-pre-trained baseline. Additionally, Kronos achieves a 9 percent lower mean absolute error in volatility forecasting and a 22 percent improvement in generative fidelity for synthetic K-line sequences. The pre-trained model is publicly available, establishing Kronos as a robust and versatile foundation model for end-to-end financial time series analysis.


📅 Published on Aug 2, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2508.02739
• PDF: https://arxiv.org/pdf/2508.02739
• GitHub: https://github.com/shiyu-coder/Kronos 22.7k

🤖 Models citing this paper:
https://huggingface.co/NeoQuasar/Kronos-base
https://huggingface.co/NeoQuasar/Kronos-Tokenizer-base
https://huggingface.co/NeoQuasar/Kronos-mini

🚀 Spaces citing this paper:
https://huggingface.co/spaces/yingfeng64/kronos-api
https://huggingface.co/spaces/almascp/kronos-eurusd-dashboard
https://huggingface.co/spaces/superyan/kronos-jp

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#FinancialLanguageModels #KLineDataAnalysis #TimeSeriesForecasting #VolatilityPrediction #FinancialMarketModeling
3
AI & ML Papers
Photo
🔥 MinerU2.5: A Decoupled Vision-Language Model for Efficient High-Resolution Document Parsing

💡 The paper introduces MinerU2.5, a 1.2 billion parameter vision-language model designed for efficient high-resolution document parsing. The model achieves state-of-the-art recognition accuracy while maintaining computational efficiency through a two-stage parsing strategy. In the first stage, the model performs layout analysis on downsampled images to identify structural elements, reducing computational overhead. In the second stage, it performs targeted content recognition on native-resolution crops extracted from the original image, preserving fine-grained details in dense text, complex formulas, and tables. To support this strategy, the authors developed a comprehensive data engine that generates diverse, large-scale training corpora for both pretraining and fine-tuning. The results demonstrate that MinerU2.5 achieves state-of-the-art performance on multiple benchmarks, surpassing both general-purpose and domain-specific models across various recognition tasks, while maintaining significantly lower computational overhead. Overall, the paper contributes a novel approach to document parsing that balances accuracy and efficiency, making it suitable for a wide range of applications.


📅 Published on Sep 26, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2509.22186
• PDF: https://arxiv.org/pdf/2509.22186
• Project Page: https://opendatalab.github.io/MinerU/
• GitHub: https://github.com/opendatalab/MinerU 61.9k

🤖 Models citing this paper:
https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B
https://huggingface.co/opendatalab/MinerU-Diffusion-V1-0320-2.5B
https://huggingface.co/freakynit/MinerU2.5-2509-1.2B

🚀 Spaces citing this paper:
https://huggingface.co/spaces/xiaoye-winters/MinerU-API
https://huggingface.co/spaces/opendatalab/MinerU-Diffusion-V1-0320-2.5B
https://huggingface.co/spaces/Instantnewdesign/document_extract

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DocumentParsing #VisionLanguageModel #HighResolutionImageProcessing #LayoutAnalysis #ContentRecognition
4