AI & ML Papers
32.8K subscribers
7.05K photos
519 videos
24 files
7.71K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
Calibri: Enhancing Diffusion Transformers via Parameter-Efficient Calibration

📝 Summary:
Calibri enhances Diffusion Transformers by adding a single learned scaling parameter to improve generative quality. This parameter-efficient method, optimizing only ~100 parameters, reduces inference steps across various text-to-image models while maintaining high-quality outputs.

🔹 Publication Date: Published on Mar 25

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.24800
• PDF: https://arxiv.org/pdf/2603.24800
• Project Page: https://v-gen-ai.github.io/Calibri-page/
• Github: https://github.com/v-gen-ai/Calibri

🔹 Models citing this paper:
https://huggingface.co/v-gen-ai/flux-calibri-gates
https://huggingface.co/v-gen-ai/qwen-calibri

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #GenerativeAI #AIResearch #MachineLearning #DeepLearning
1
Diffutron: A Masked Diffusion Language Model for Turkish Language

📝 Summary:
Diffutron introduces a compact masked diffusion language model for Turkish. It uses resource-efficient LoRA-based pre-training and progressive instruction tuning. The model achieves competitive performance for non-autoregressive Turkish text generation despite its small size.

🔹 Publication Date: Published on Mar 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.20466
• PDF: https://arxiv.org/pdf/2603.20466

🔹 Models citing this paper:
https://huggingface.co/diffutron/DiffutronLM-0.3B-Instruct
https://huggingface.co/diffutron/DiffutronLM-0.3B-Base
https://huggingface.co/diffutron/DiffutronLM-0.3B-1st-Stage

Datasets citing this paper:
https://huggingface.co/datasets/diffutron/DiffutronLM-Pretraining-Corpus

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#LanguageModels #TurkishNLP #DiffusionModels #NLP #AI
On-the-fly Repulsion in the Contextual Space for Rich Diversity in Diffusion Transformers

📝 Summary:
Diffusion transformers often lack visual diversity. This paper introduces on-the-fly repulsion in the contextual space to enhance diversity. It intervenes in multimodal attention during the forward pass, yielding rich outcomes without losing quality or efficiency.

🔹 Publication Date: Published on Mar 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.28762
• PDF: https://arxiv.org/pdf/2603.28762
• Project Page: https://contextual-repulsion.github.io/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #DeepLearning #GenerativeAI #ComputerVision #AIResearch
PoseDreamer: Scalable and Photorealistic Human Data Generation Pipeline with Diffusion Models

📝 Summary:
PoseDreamer uses diffusion models to generate large-scale, photorealistic synthetic 3D human mesh datasets with improved image quality. Models trained on this data achieve comparable or superior performance to those using real or traditional synthetic datasets, offering a scalable solution.

🔹 Publication Date: Published on Mar 30

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2603.28763
• PDF: https://arxiv.org/pdf/2603.28763
• Project Page: https://prosperolo.github.io/posedreamer

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #SyntheticData #3DGeneration #ComputerVision #AIResearch
1
This media is not supported in your browser
VIEW IN TELEGRAM
VOID: Video Object and Interaction Deletion

📝 Summary:
VOID is a video object removal framework designed for complex scenarios involving significant object interactions. It uses vision-language and video diffusion models, leveraging causal reasoning to generate physically plausible counterfactual scenes. VOID better preserves consistent scene dynamic...

🔹 Publication Date: Published on Apr 2

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.02296
• PDF: https://arxiv.org/pdf/2604.02296
• Project Page: https://void-model.github.io/
• Github: https://github.com/Netflix/void-model

🔹 Models citing this paper:
https://huggingface.co/netflix/void-model

Spaces citing this paper:
https://huggingface.co/spaces/sam-motamed/VOID

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoEditing #DiffusionModels #ComputerVision #GenerativeAI #DeepLearning
RefineAnything: Multimodal Region-Specific Refinement for Perfect Local Details

📝 Summary:
RefineAnything is a multimodal diffusion model for region-specific image refinement. It fixes local detail collapse while strictly preserving backgrounds using a Focus-and-Refine strategy and boundary-aware loss. This provides a practical solution for high-precision local editing.

🔹 Publication Date: Published on Apr 8

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.06870
• PDF: https://arxiv.org/pdf/2604.06870
• Project Page: https://limuloo.github.io/RefineAnything/
• Github: https://github.com/limuloo/RefineAnything

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #ImageEditing #ComputerVision #DeepLearning #GenerativeAI
Media is too big
VIEW IN TELEGRAM
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

📝 Summary:
Matrix-Game 3.0 is a memory-augmented diffusion model achieving real-time 720p interactive video generation with long-term temporal consistency. It uses an advanced data engine, a self-correction training framework with memory, and efficient inference strategies. This enables practical, industria...

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08995
• PDF: https://arxiv.org/pdf/2604.08995
• Project Page: https://matrix-game-v3.github.io/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #VideoGeneration #RealTimeAI #GenerativeAI #MachineLearning
CT-1: Vision-Language-Camera Models Transfer Spatial Reasoning Knowledge to Camera-Controllable Video Generation

📝 Summary:
CT-1 is a Vision-Language-Camera model that improves camera-controllable video generation. It uses a Diffusion Transformer and Wavelet Regularization Loss to accurately estimate camera trajectories, enabling precise video synthesis. This achieves 25.7% better accuracy than prior methods.

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09201
• PDF: https://arxiv.org/pdf/2604.09201
• Project Page: https://gulucaptain.github.io/Camera-Transformer-1/
• Github: https://github.com/gulucaptain/Camera-Transformer-1

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#AI #VideoGeneration #ComputerVision #DiffusionModels #VisionLanguageModels
MixFlow: Mixed Source Distributions Improve Rectified Flows

📝 Summary:
Rectified flows and diffusion models are improved through κ-FC formulation that conditions the source distribution and MixFlow training strategy that reduces generative path curvatures and enhances sa...

🔹 Publication Date: Published on Apr 10

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.09181
• PDF: https://arxiv.org/pdf/2604.09181
• Github: https://github.com/NazirNayal8/MixFlow

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#RectifiedFlows #DiffusionModels #GenerativeAI #MachineLearning #AIResearch
Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

📝 Summary:
Uni-ViGU introduces a unified framework for video generation and understanding, uniquely building upon a video generator as its foundation. It uses unified flow matching and a bidirectional training mechanism to achieve competitive performance in both generation and understanding tasks.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08121
• PDF: https://arxiv.org/pdf/2604.08121
• Project Page: https://fr0zencrane.github.io/uni-vigu-page/
• Github: https://fr0zencrane.github.io/uni-vigu-page/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoGeneration #VideoUnderstanding #DiffusionModels #AIResearch #DeepLearning
Domain-Specific Latent Representations Improve the Fidelity of Diffusion-Based Medical Image Super-Resolution

📝 Summary:
Domain-specific autoencoders significantly enhance medical image super-resolution. Replacing generic VAEs improves fidelity, showing autoencoder choice is key, not the diffusion architecture. Autoencoder performance predicts overall SR quality.

🔹 Publication Date: Published on Apr 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.12152
• PDF: https://arxiv.org/pdf/2604.12152
• Github: https://github.com/sebasmos/latent-sr

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#MedicalImaging #SuperResolution #DiffusionModels #DeepLearning #Autoencoders
This media is not supported in your browser
VIEW IN TELEGRAM
Repurposing 3D Generative Model for Autoregressive Layout Generation

📝 Summary:
LaviGen is a 3D layout generation framework that repurposes 3D generative models. It uses an adapted 3D diffusion model for autoregressive generation, explicitly modeling geometric relations and physical constraints. This achieves superior, more plausible 3D layouts 65% faster than previous methods.

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16299
• PDF: https://arxiv.org/pdf/2604.16299
• Project Page: https://fenghora.github.io/LaviGen-Page/
• Github: https://github.com/fenghora/LaviGen

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#3DGeneration #DiffusionModels #GenerativeAI #ComputerGraphics #DeepLearning
Media is too big
VIEW IN TELEGRAM
Hierarchical Codec Diffusion for Video-to-Speech Generation

📝 Summary:
HiCoDiT generates speech from videos by leveraging the hierarchical structure of discrete speech tokens, achieving better audio-visual alignment through coarse-to-fine conditioning with dual-scale nor...

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.15923
• PDF: https://arxiv.org/pdf/2604.15923

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#VideoToSpeech #DiffusionModels #GenerativeAI #SpeechSynthesis #DeepLearning
UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

📝 Summary:
UDM-GRPO integrates Uniform Discrete Diffusion Models with reinforcement learning, solving training instability issues. It optimizes using final samples as actions and reconstructed trajectories. This achieves state-of-the-art performance in text-to-image generation and OCR tasks.

🔹 Publication Date: Published on Apr 20

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.18518
• PDF: https://arxiv.org/pdf/2604.18518
• Project Page: https://yovecent.github.io/UDM-GRPO.github.io/
• Github: https://github.com/Yovecent/UDM-GRPO

🔹 Models citing this paper:
https://huggingface.co/Yovecents/URSA-1.7B-IBQ512-UDMGRPO-GenEval
https://huggingface.co/Yovecents/URSA-1.7B-IBQ512-UDMGRPO-PickScore

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#DiffusionModels #ReinforcementLearning #GenerativeAI #TextToImage #DeepLearning
1
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

📝 Summary:
dWorldEval proposes a scalable robotics policy evaluation method using a discrete diffusion world model. It unifies diverse modalities into a token space, employing a transformer and progress token for success detection. This approach significantly outperforms prior methods, enabling large-scale ...

🔹 Publication Date: Published on Apr 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22152
• PDF: https://arxiv.org/pdf/2604.22152

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#Robotics #DiffusionModels #WorldModels #AI #MachineLearning
DiffNR: Diffusion-Enhanced Neural Representation Optimization for Sparse-View 3D Tomographic Reconstruction

📝 Summary:
DiffNR enhances sparse-view CT reconstruction with neural representations by employing SliceFixer, a single-step diffusion model. It corrects artifacts via pseudo-reference volumes, offering 3D supervision for better accuracy and efficient optimization, with a 3.99 dB PSNR gain.

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.21518
• PDF: https://arxiv.org/pdf/2604.21518
• Project Page: https://ooonesevennn.github.io/DiffNR/
• Github: https://github.com/ooonesevennn/DiffNR

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#3DReconstruction #DiffusionModels #NeuralNetworks #CTReconstruction #DeepLearning
AI & ML Papers
Photo
🔥 SOAR: Self-Correction for Optimal Alignment and Refinement in Diffusion Models

💡 The paper introduces a new post-training method called SOAR for diffusion models, which addresses the gap between supervised fine-tuning and reinforcement learning. Currently, supervised fine-tuning optimizes the denoiser only on ground-truth states, but once inference deviates from these ideal states, it relies on out-of-distribution generalization rather than learned correction, leading to exposure bias. Reinforcement learning can address this mismatch, but its terminal reward signal is sparse and suffers from credit-assignment difficulty.

SOAR proposes a bias-correction post-training method that fills this gap by providing dense, reward-free supervision through self-correction mechanisms. The method starts from a real sample, performs a single stop-gradient rollout with the current model, re-noises the resulting off-trajectory state, and supervises the model to steer back toward the original clean target. This approach is on-policy, reward-free, and provides dense per-timestep supervision with no credit-assignment problem.

The results show that SOAR improves the performance of diffusion models on various tasks, including image and text generation. On the SD3.5-Medium dataset, SOAR improves the GenEval score from 0.70 to 0.78 and the OCR score from 0.64 to 0.67 over supervised fine-tuning. Additionally, SOAR surpasses the performance of Flow-GRPO in final metric value on both aesthetic and text-image alignment tasks, despite having no access to a reward model. The paper concludes that SOAR can directly replace supervised fine-tuning as a stronger first post-training stage after pretraining, while remaining fully compatible with subsequent reinforcement learning alignment.


📅 Published on Apr 14

🔗 Links:
• arXiv: https://arxiv.org/abs/2604.12617
• PDF: https://arxiv.org/pdf/2604.12617
• Project Page: https://hy-soar.github.io/
• GitHub: https://github.com/Tencent-Hunyuan/HY-SOAR 350

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionModels #SelfCorrectionTechniques #OptimalAlignmentMethods #RefinementInAI #PostTrainingMethods
AI & ML Papers
Photo
🔥 D-OPSD: On-Policy Self-Distillation for Continuously Tuning Step-Distilled Diffusion Models

💡 The paper introduces D-OPSD, a new training approach for diffusion models that enables efficient supervised fine-tuning while preserving few-step inference capabilities. The current landscape of high-performance image generation models is shifting from inefficient multi-step models to efficient few-step models, but these models are challenging to fine-tune using traditional techniques. The problem with traditional fine-tuning methods is that they compromise the model's inherent few-step inference capability.

To address this issue, the authors propose D-OPSD, which leverages on-policy self-distillation with text and multimodal features. The method works by making the model act as both the teacher and the student, where the student is conditioned only on the text feature, and the teacher is conditioned on the multimodal feature of both the text prompt and the target image. The training process minimizes the difference between the predicted distributions over the student's own roll-outs, allowing the model to learn new concepts and styles without sacrificing its original few-step capacity.

The key contribution of D-OPSD is that it enables on-policy learning during supervised fine-tuning, which allows the model to learn from its own trajectory and under its own supervision. This approach enables the model to inherit the in-context capabilities of its encoder, making it possible to fine-tune the model continuously without compromising its few-step inference capability. The results show that D-OPSD enables efficient supervised fine-tuning for diffusion models, making it a promising approach for high-performance image generation models.


📅 Published on May 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.05204
• PDF: https://arxiv.org/pdf/2605.05204
• Project Page: https://vvvvvjdy.github.io/d-opsd/
• GitHub: https://github.com/vvvvvjdy/D-OPSD 24

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#DiffusionModels #SelfDistillation #FewShotLearning #ImageGeneration #MultimodalLearning
2
AI & ML Papers
Photo
🔥 MARBLE: Multi-Aspect Reward Balance for Diffusion RL

💡 The paper introduces MARBLE, a novel gradient-space optimization framework for multi-reward reinforcement learning fine-tuning of diffusion models. The problem addressed is that existing methods for handling multiple rewards either train separate models for each reward or use a weighted-sum reward aggregation, which can lead to poor performance due to sample-level mismatch. This mismatch occurs because most rollouts are highly informative for certain reward dimensions but irrelevant for others, causing the weighted summation to dilute their supervision.

To address this issue, MARBLE maintains independent advantage estimators for each reward and computes per-reward policy gradients. These gradients are then harmonized into a single update direction without manual reward weighting, by solving a quadratic programming problem. This approach allows for a unified model that can be jointly trained on all rewards, eliminating the need for heavy manual tuning and sequential training.

The authors also propose an amortized formulation that reduces the computational cost of MARBLE, making it more efficient. Additionally, they use exponential moving average smoothing on the balancing coefficients to stabilize updates against transient fluctuations.

The results show that MARBLE improves all five reward dimensions simultaneously on the SD3.5 Medium dataset, outperforming the baseline method. Specifically, MARBLE turns the worst-aligned reward's gradient cosine from negative to consistently positive, indicating better alignment with human preferences. Furthermore, MARBLE runs at nearly the same training speed as the baseline method, with only a 3% slowdown. Overall, MARBLE provides a more effective and efficient approach to multi-reward reinforcement learning fine-tuning of diffusion models.


📅 Published on May 7

🔗 Links:
• arXiv: https://arxiv.org/abs/2605.06507
• PDF: https://arxiv.org/pdf/2605.06507
• Project Page: https://aim-uofa.github.io/MARBLE/
• GitHub: https://github.com/aim-uofa/MARBLE 24

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultiRewardReinforcementLearning #DiffusionModels #GradientSpaceOptimization #MultiAspectRewardBalance #ReinforcementLearningFineTuning
1
AI & ML Papers
Photo
🔥 i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

💡 The paper presents a comprehensive study of text-to-image diffusion models, aiming to identify key design choices and training insights that lead to strong model performance. The problem addressed is the lack of fully open models that match the performance of state-of-the-art models, which hinders further research in the field. To tackle this, the authors conducted over 300 controlled experiments, totaling 700K TPU v6e hours, to investigate modeling and data design choices in text-to-image diffusion training and inference.

The method used involved a systematic investigation of various design decisions, such as dataset mixing and text encoder adapters, to identify simple yet effective approaches to training strong models. The authors found several empirical findings, including the use of equal weighting for mixing curated datasets and the benefits of larger text encoder adapters.

The results of the study led to the development of i1, a 3B-parameter text-to-image diffusion model trained using only publicly available datasets. The i1 model is competitive with leading models on five representative benchmarks and outperforms the best existing fully open model by 29.5 absolute percentage points on average. The authors provide the i1 checkpoints, training and inference code, and the data processing pipeline, making it a fully open model that can serve as a foundation for future research in text-to-image diffusion models.

Overall, the paper contributes to the field by providing a practical foundation for open research in text-to-image diffusion models, highlighting the importance of transparency and reproducibility in AI research. The release of the i1 model and its associated code and data processing pipeline enables the research community to build upon and improve the model, driving further progress in the field.


📅 Published on Jun 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.11289
• PDF: https://arxiv.org/pdf/2606.11289
• Project Page: https://zlab-princeton.github.io/i1/

🤖 Models citing this paper:
https://huggingface.co/zlab-princeton/i1-3B

📊 Datasets citing this paper:
https://huggingface.co/datasets/zlab-princeton/i1-captions
https://huggingface.co/datasets/zlab-princeton/i1-gptedit-tfrecord

🚀 Spaces citing this paper:
https://huggingface.co/spaces/multimodalart/i1-3B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #DiffusionModels #TextEncoderAdapters #ImageSynthesis #DeepLearningModels