AI & ML Papers

🔥 Lens: Rethinking Training Efficiency for Foundational Text-to-Image Models

💡 The paper introduces Lens, a compact 3.8 billion parameter text-to-image model that achieves superior performance with reduced training compute. The problem addressed is the high computational cost of training large text-to-image models, which can be a significant barrier to their adoption. To address this, the authors propose two key strategies. First, they maximize data information density per training batch by using a dataset of 800 million densely captioned image-text pairs, where each caption contains approximately 109 words on average, providing richer semantic supervision than conventional short captions. They also construct each batch from images with multiple resolutions and diverse aspect ratios, enlarging the effective visual coverage of each optimization step.

Second, they improve convergence speed through careful architectural choices, including adopting a semantic VAE that provides better latent representations and employing a strong language encoder that accelerates optimization while enabling multilingual generalization from English-only training data. The authors also apply reinforcement learning with taxonomy-driven prompts and structured reward rubrics to suppress artifacts and improve visual quality, and use a reasoner module with training-free system prompt search to better align user requests with the model.

The results show that Lens achieves performance competitive with, and in several cases surpassing, state-of-the-art models with more than 6 billion parameters, while requiring significantly less training compute. For example, Lens requires only about 19.3% of the training compute used by Z-Image. The model generalizes to arbitrary aspect ratios and resolutions up to 1440^2, and supports prompts in several commonly used languages. Thanks to its compact size, Lens generates a 1024^2 image in 3.15 seconds on a single NVIDIA H100 GPU, while its distilled turbo version performs 4-step generation in 0.84 seconds. Overall, the paper demonstrates that Lens is a highly efficient and effective text-to-image model that can be trained with significantly less computational resources than existing models.

📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21573
• PDF: https://arxiv.org/pdf/2605.21573
• Project Page: https://huggingface.co/microsoft/Lens

🤖 Models citing this paper:
• https://huggingface.co/microsoft/Lens-Turbo
• https://huggingface.co/microsoft/Lens
• https://huggingface.co/microsoft/Lens-Base

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/lens

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #EfficientTrainingMethods #CompactNeuralNetworks #ImageTextPairs #FoundationalModeling

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

497 views21:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Photo

🔥 i1: A Simple and Fully Open Recipe for Strong Text-to-Image Models

💡 The paper presents a comprehensive study of text-to-image diffusion models, aiming to identify key design choices and training insights that lead to strong model performance. The problem addressed is the lack of fully open models that match the performance of state-of-the-art models, which hinders further research in the field. To tackle this, the authors conducted over 300 controlled experiments, totaling 700K TPU v6e hours, to investigate modeling and data design choices in text-to-image diffusion training and inference.

The method used involved a systematic investigation of various design decisions, such as dataset mixing and text encoder adapters, to identify simple yet effective approaches to training strong models. The authors found several empirical findings, including the use of equal weighting for mixing curated datasets and the benefits of larger text encoder adapters.

The results of the study led to the development of i1, a 3B-parameter text-to-image diffusion model trained using only publicly available datasets. The i1 model is competitive with leading models on five representative benchmarks and outperforms the best existing fully open model by 29.5 absolute percentage points on average. The authors provide the i1 checkpoints, training and inference code, and the data processing pipeline, making it a fully open model that can serve as a foundation for future research in text-to-image diffusion models.

Overall, the paper contributes to the field by providing a practical foundation for open research in text-to-image diffusion models, highlighting the importance of transparency and reproducibility in AI research. The release of the i1 model and its associated code and data processing pipeline enables the research community to build upon and improve the model, driving further progress in the field.

📅 Published on Jun 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.11289
• PDF: https://arxiv.org/pdf/2606.11289
• Project Page: https://zlab-princeton.github.io/i1/

🤖 Models citing this paper:
• https://huggingface.co/zlab-princeton/i1-3B

📊 Datasets citing this paper:
• https://huggingface.co/datasets/zlab-princeton/i1-captions
• https://huggingface.co/datasets/zlab-princeton/i1-gptedit-tfrecord

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/multimodalart/i1-3B

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#TextToImageModels #DiffusionModels #TextEncoderAdapters #ImageSynthesis #DeepLearningModels

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

428 views13:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform