AI & ML Papers

🔥 SmolDocling: An ultra-compact vision-language model for end-to-end multi-modal document conversion

💡 The paper introduces SmolDocling, a compact vision-language model designed for end-to-end document conversion. The model aims to process entire pages and generate a new universal markup format called DocTags, which captures all page elements in their full context with location. Unlike existing approaches that rely on large foundational models or ensemble solutions, SmolDocling offers a single end-to-end conversion model with 256M parameters. This approach allows for accurately capturing content, structure, and spatial location of document elements.

The model is trained to reproduce document features such as code listings, tables, equations, charts, lists, and more across a diverse range of document types, including business documents, academic papers, technical reports, patents, and forms. The authors also contribute novel publicly sourced datasets for charts, tables, equations, and code recognition.

Experimental results demonstrate that SmolDocling performs competitively with other vision language models that are up to 27 times larger in size, while reducing computational requirements substantially. The model's compact size and robust performance make it a significant contribution to the field of document conversion. The authors plan to make the model and datasets publicly available, which will facilitate further research and development in this area. Overall, SmolDocling offers a efficient and effective solution for end-to-end document conversion, with potential applications in various industries and domains.

arXiv.org

SmolDocling: An ultra-compact vision-language model for end-to-end...

We introduce SmolDocling, an ultra-compact vision-language model targeting end-to-end document conversion. Our model comprehensively processes entire pages by generating DocTags, a new universal...

427 views00:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

346 views02:56

AI & ML Papers

Photo

🔥 OpenDevin: An Open Platform for AI Software Developers as Generalist Agents

💡 The paper introduces OpenDevin, a platform for developing artificial intelligence agents that can interact with the world in ways similar to human developers. The platform allows AI agents to write code, use command lines, and browse the web, with support for multiple agents and evaluation benchmarks. The goal of OpenDevin is to provide a flexible and powerful platform for AI agents to interact with their environment, similar to how human developers use software to interact with the world.

The method used to develop OpenDevin involves creating a platform that can support the implementation of new agents, safe interaction with sandboxed environments for code execution, and coordination between multiple agents. The platform also incorporates evaluation benchmarks to assess the performance of the agents. The development of OpenDevin is a community effort, with contributions from over 160 contributors, and is released under the permissive MIT license.

The results of the paper include the evaluation of agents over 15 challenging tasks, including software engineering and web browsing. The evaluation benchmarks demonstrate the flexibility and power of the OpenDevin platform in supporting AI agents that can interact with the world in complex ways. The paper also highlights the potential of OpenDevin to improve over time, with ongoing contributions from the community. Overall, the paper contributes to the development of AI agents that can interact with the world in ways similar to human developers, with potential applications in a wide range of areas, including software engineering and web development.

📅 Published on Jul 23, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2407.16741
• PDF: https://arxiv.org/pdf/2407.16741
• GitHub: https://github.com/opendevin/opendevin ⭐ 72.6k

📊 Datasets citing this paper:
• https://huggingface.co/datasets/GloriaaaM/LLM-Agent-Harness-Survey
• https://huggingface.co/datasets/namanvats/harbor-goose-openhands-benchmark
• https://huggingface.co/datasets/antieval/aware-bench-trajectories

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ArtificialIntelligenceAgents #AIPlatformDevelopment #GeneralistAgents #HumanCentricAI #AIEnvironmentInteraction

arXiv.org

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Software is one of the most powerful tools that we humans have at our disposal; it allows a skilled programmer to interact with the world in complex and profound ways. At the same time, thanks to...

❤1

451 views02:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

377 views04:56

AI & ML Papers

Photo

🔥 PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B Ultra-Compact Vision-Language Model

💡 The paper proposes PaddleOCR-VL, a state-of-the-art and resource-efficient model for document parsing. The problem addressed is the need for a model that can accurately recognize elements in documents, such as text, tables, formulas, and charts, while being efficient in terms of resource consumption. To solve this problem, the authors propose a vision-language model that combines a NaViT-style dynamic resolution visual encoder with the ERNIE language model. The resulting model, PaddleOCR-VL-0.9B, is a compact yet powerful model that can support 109 languages and recognize complex elements with high accuracy. The method used to achieve this is the integration of the visual encoder and language model, which enables the model to efficiently process documents and recognize elements. The results show that PaddleOCR-VL achieves state-of-the-art performance in both page-level document parsing and element-level recognition, outperforming existing solutions and exhibiting strong competitiveness against top-tier vision-language models. The model also delivers fast inference speeds, making it highly suitable for practical deployment in real-world scenarios. The code for the model is available, making it accessible for further research and development. Overall, the paper contributes a highly efficient and accurate model for document parsing, which can be used in a variety of applications.

📅 Published on Oct 16, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2510.14528
• PDF: https://arxiv.org/pdf/2510.14528
• GitHub: https://github.com/PaddlePaddle/PaddleOCR ⭐ 77.1k

🤖 Models citing this paper:
• https://huggingface.co/PaddlePaddle/PaddleOCR-VL
• https://huggingface.co/PaddlePaddle/PP-DocLayoutV2
• https://huggingface.co/unsloth/PaddleOCR-VL

📊 Datasets citing this paper:
• https://huggingface.co/datasets/proxectonos/corpus_dominio_cientifico

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/PaddlePaddle/PaddleOCR-VL_Online_Demo
• https://huggingface.co/spaces/eduagarcia/multilingual-tokenizer-leaderboard
• https://huggingface.co/spaces/waytoAGI/PaddleOCR-VL_Online_Demo

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultilingualDocumentParsing #VisionLanguageModels #DocumentAnalysis #TableRecognition #MultimodalLearning

arXiv.org

PaddleOCR-VL: Boosting Multilingual Document Parsing via a 0.9B...

In this report, we propose PaddleOCR-VL, a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model...

❤3

450 views04:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 LTX-2: Efficient Joint Audio-Visual Foundation Model

💡 The paper introduces LTX-2, an open-source audiovisual diffusion model that generates synchronized video and audio content. The problem addressed is that current text-to-video diffusion models can generate compelling video sequences but lack the semantic, emotional, and atmospheric cues that audio provides. To solve this, the authors propose a dual-stream transformer architecture with cross-modal attention and classifier-free guidance. The model consists of a 14 billion parameter video stream and a 5 billion parameter audio stream, coupled through bidirectional audio-video cross-attention layers. This architecture enables efficient training and inference of a unified audiovisual model, with more capacity allocated for video generation than audio generation.

The method used to achieve this includes employing a multilingual text encoder for broader prompt understanding and introducing a modality-aware classifier-free guidance mechanism for improved audiovisual alignment and controllability. The model is trained using a combination of video and audio data, with temporal positional embeddings and cross-modality AdaLN for shared timestep conditioning.

The results show that LTX-2 achieves state-of-the-art audiovisual quality and prompt adherence among open-source systems, while delivering results comparable to proprietary models at a fraction of their computational cost and inference time. The model is capable of generating high-quality, temporally synchronized audiovisual content, including rich and coherent audio tracks that follow the characters, environment, style, and emotion of each scene. The model weights and code are publicly released, making it accessible for further research and development. Overall, LTX-2 provides a significant contribution to the field of audiovisual generation, enabling the creation of more realistic and engaging multimedia content.

📅 Published on Jan 6

🔗 Links:
• arXiv: https://arxiv.org/abs/2601.03233
• PDF: https://arxiv.org/pdf/2601.03233
• Project Page: https://app.ltx.studio/ltx-2-playground/i2v
• GitHub: https://github.com/Lightricks/LTX-2 ⭐ 6.4k

🤖 Models citing this paper:
• https://huggingface.co/Lightricks/LTX-2
• https://huggingface.co/Lightricks/LTX-2.3
• https://huggingface.co/unsloth/LTX-2.3-GGUF

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/linoyts/LTX-2-3-First-Last-Frame
• https://huggingface.co/spaces/linoyts/LTX-2-3-sync
• https://huggingface.co/spaces/linoyts/LTX-2-3-outpaint

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AudioVisualLearning #MultimodalDiffusionModels #CrossModalAttention #AudioVideoGeneration #JointFoundationModels

arXiv.org

LTX-2: Efficient Joint Audio-Visual Foundation Model

Recent text-to-video diffusion models can generate compelling video sequences, yet they remain silent -- missing the semantic, emotional, and atmospheric cues that audio provides. We introduce...

❤3

497 views06:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

0:26

This media is not supported in your browser

VIEW IN TELEGRAM

👆

Please open Telegram to view this post

VIEW IN TELEGRAM

452 viewsedited 06:56

AI & ML Papers

360 views08:56

AI & ML Papers

Photo

🔥 LightRAG: Simple and Fast Retrieval-Augmented Generation

💡 The paper introduces LightRAG, a novel approach to improve Retrieval-Augmented Generation systems, which enhance large language models by integrating external knowledge sources. Existing systems have limitations, including reliance on flat data representations and inadequate contextual awareness, leading to fragmented answers that fail to capture complex inter-dependencies. To address these challenges, LightRAG incorporates graph structures into text indexing and retrieval processes, employing a dual-level retrieval system that enhances comprehensive information retrieval from both low-level and high-level knowledge discovery. The integration of graph structures with vector representations facilitates efficient retrieval of related entities and their relationships, significantly improving response times while maintaining contextual relevance. An incremental update algorithm ensures the timely integration of new data, allowing the system to remain effective and responsive in rapidly changing data environments. The experimental results demonstrate considerable improvements in retrieval accuracy and efficiency compared to existing approaches, making LightRAG a significant contribution to the field of Retrieval-Augmented Generation. The authors have made LightRAG open-source, making it available for further development and application. Overall, LightRAG provides a simple and fast retrieval-augmented generation approach that achieves better accuracy and response times, making it a valuable tool for data science applications.

📅 Published on Oct 8, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2410.05779
• PDF: https://arxiv.org/pdf/2410.05779
• GitHub: https://github.com/hkuds/lightrag ⭐ 34.7k
• Project Page: https://huggingface.co/Neha12210/project2-advanced-rag

🤖 Models citing this paper:
• https://huggingface.co/muthuk1/graphrag-inference-hackathon
• https://huggingface.co/atad-tokyo/GST_LIVING_NOVEL
• https://huggingface.co/Neha12210/project2-advanced-rag

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/rm-lht/lightrag

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RetrievalAugmentedGeneration #GraphBasedInformationRetrieval #KnowledgeDiscoverySystems #LargeLanguageModels #TextIndexingTechniques

arXiv.org

LightRAG: Simple and Fast Retrieval-Augmented Generation

Retrieval-Augmented Generation (RAG) systems enhance large language models (LLMs) by integrating external knowledge sources, enabling more accurate and contextually relevant responses tailored to...

👍1

446 views08:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

355 views10:56

AI & ML Papers

Photo

🔥 RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

💡 The paper introduces RF-DETR, a lightweight detection transformer that uses neural architecture search to optimize accuracy and latency for real-time object detection. The motivation behind this work is that current state-of-the-art detectors often fail to generalize to real-world datasets with classes not seen during pre-training. Instead of fine-tuning a large vision-language model, the authors propose RF-DETR, which fine-tunes a pre-trained base network on a target dataset and evaluates thousands of network configurations to find the best accuracy-latency tradeoff. The approach uses weight-sharing neural architecture search to improve transferability to diverse target domains. The results show that RF-DETR significantly outperforms prior state-of-the-art real-time methods on several datasets, including COCO and Roboflow100-VL. Specifically, RF-DETR achieves 48.0 AP on COCO, beating a similar method by 5.3 AP at similar latency, and RF-DETR also outperforms another method on Roboflow100-VL while running 20 times as fast. Notably, RF-DETR is the first real-time detector to surpass 60 AP on COCO, demonstrating its effectiveness in achieving high accuracy and low latency. The code for RF-DETR is made available, allowing for further research and development in the field of real-time object detection.

📅 Published on Nov 12, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2511.09554
• PDF: https://arxiv.org/pdf/2511.09554
• Project Page: https://rfdetr.roboflow.com/1.3.0/
• GitHub: https://github.com/roboflow/rf-detr ⭐ 6.9k

🤖 Models citing this paper:
• https://huggingface.co/stevenbucaille/rf-detr-small
• https://huggingface.co/stevenbucaille/rf-detr-nano
• https://huggingface.co/stevenbucaille/rf-detr-base

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/arihant3704/rf-detr-playground

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#RealTimeObjectDetection #NeuralArchitectureSearch #DetectionTransformers #WeightSharingNAS #EfficientComputerVision

arXiv.org

RF-DETR: Neural Architecture Search for Real-Time Detection Transformers

Open-vocabulary detectors achieve impressive performance on COCO, but often fail to generalize to real-world datasets with out-of-distribution classes not typically found in their pre-training....

❤1

453 views10:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

382 views12:56

AI & ML Papers

Photo

🔥 Agent READMEs: An Empirical Study of Context Files for Agentic Coding

💡 This paper presents the first large scale empirical study of agent context files, also known as agent READMEs, which provide persistent and project level instructions for agentic coding tools. The study analyzed 2303 agent context files from 1925 repositories to characterize their structure, maintenance, and content. The researchers found that these files are not static documentation but complex and difficult to read artifacts that evolve like configuration code, maintained through frequent small additions.

The content analysis of 16 instruction types revealed that developers prioritize functional context, such as build and run commands, implementation details, and architecture. However, the study also identified a significant gap, as non-functional requirements like security and performance are rarely specified. The findings indicate that while developers use context files to make agents functional, they provide few guardrails to ensure that agent-written code is secure or performant.

The study highlights the need for improved tooling and practices to address this gap. The contributions of this paper include a comprehensive understanding of the structure and content of agent context files, and the identification of areas for improvement to ensure that agentic coding tools produce secure and performant code. The research has implications for the development of agentic coding tools and the use of agent context files in software development projects. Overall, the study provides valuable insights into the use of agent context files and highlights the need for further research and development to improve the security and performance of agent-written code.

📅 Published on Nov 17, 2025

🔗 Links:
• arXiv: https://arxiv.org/abs/2511.12884
• PDF: https://arxiv.org/pdf/2511.12884
• Project Page: https://agents.md
• GitHub: https://github.com/openai/agents.md ⭐ 21.0k

📊 Datasets citing this paper:
• https://huggingface.co/datasets/hao-li/AIDev
• https://huggingface.co/datasets/farida5gaber/AIDev
• https://huggingface.co/datasets/dysavepeople/AIDev

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#AgenticCodingTools #AgentContextFiles #EmpiricalSoftwareEngineering #AgenticREADMEs #SoftwareDevelopmentArtifacts

arXiv.org

Agent READMEs: An Empirical Study of Context Files for Agentic Coding

Agentic coding tools receive goals written in natural language as input, break them down into specific tasks, and write or execute the actual code with minimal human intervention. Central to this...

476 views12:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

❤‍🔥1

389 views14:56

AI & ML Papers

Photo

🔥 OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

💡 The paper introduces OmniFlatten, a novel end-to-end GPT model that enables real-time natural full-duplex spoken dialogue. The goal is to achieve low latency and natural interactions in full-duplex dialogue systems, which is a significant challenge due to human conversation dynamics such as interruptions, backchannels, and overlapping speech. To address this, the authors propose a multi-stage post-training technique that integrates speech and text without altering the original model's architecture. The training process consists of three stages: modality alignment, half-duplex dialogue learning, and full-duplex dialogue learning. A flattening operation is used to standardize the data, allowing for unified training methods and model architecture across different modalities and tasks. The OmniFlatten model can generate text and speech in real-time, effectively modeling complex behaviors inherent to natural conversations. The approach offers a straightforward modeling technique and a promising research direction for developing efficient and natural end-to-end full-duplex spoken dialogue systems. The results are demonstrated through audio samples of dialogues generated by OmniFlatten, which can be found online. Overall, the paper contributes to the development of full-duplex spoken dialogue systems that can mimic human-human interactions, with potential applications in various areas such as virtual assistants, customer service, and more.

📅 Published on Oct 23, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2410.17799
• PDF: https://arxiv.org/pdf/2410.17799
• GitHub: https://github.com/karpathy/nanogpt ⭐ 57.6k

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#GPTModelArchitecture #FullDuplexDialogueSystems #NaturalLanguageProcessing #SpeechRecognitionTechniques #EndToEndConversationalAI

arXiv.org

OmniFlatten: An End-to-end GPT Model for Seamless Voice Conversation

Full-duplex spoken dialogue systems significantly surpass traditional turn-based dialogue systems, as they allow simultaneous bidirectional communication, closely mirroring human-human...

479 views14:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

367 views16:56

AI & ML Papers

Photo

🔥 DeepSeek-V3 Technical Report

💡 DeepSeek-V3 is a language model that achieves high performance with efficient training and minimal computational cost. The model uses a Mixture-of-Experts architecture with 671 billion total parameters, but only 37 billion are activated for each token, making it parameter-efficient. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention and DeepSeekMoE architectures, which were validated in the previous version of the model.

The model is trained on 14.8 trillion diverse and high-quality tokens, followed by supervised fine-tuning and reinforcement learning stages to fully harness its capabilities. The training process is stable and requires only 2.788 million H800 GPU hours for full training, which is relatively low compared to other models.

The results show that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. The model also pioneers an auxiliary-loss-free strategy for load balancing and uses a multi-token prediction training objective for stronger performance. The model checkpoints are available for further research and development.

Overall, the DeepSeek-V3 model makes significant contributions to the field of natural language processing by providing a highly efficient and effective language model that can be trained with minimal computational resources. The model's stable training process and low computational cost make it an attractive option for researchers and developers who want to build high-performance language models without incurring high costs.

📅 Published on Dec 27, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2412.19437
• PDF: https://arxiv.org/pdf/2412.19437
• GitHub: https://github.com/deepseek-ai/deepseek-v3 ⭐ 103.4k

🤖 Models citing this paper:
• https://huggingface.co/deepseek-ai/DeepSeek-V3
• https://huggingface.co/deepseek-ai/DeepSeek-V3-0324
• https://huggingface.co/deepseek-ai/DeepSeek-V3-Base

📊 Datasets citing this paper:
• https://huggingface.co/datasets/alpha-one-index/awesome-ai-index
• https://huggingface.co/datasets/jeffliulab/visinject
• https://huggingface.co/datasets/AcroYAMALEX/acro-yamalex-llmjp-4-math-cot

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/nanotron/ultrascale-playbook
• https://huggingface.co/spaces/Ki-Seki/ultrascale-playbook-zh-cn
• https://huggingface.co/spaces/weege007/ultrascale-playbook

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MixtureOfExpertsArchitecture #DeepLearningModels #ParameterEfficientTraining #LatentAttentionMechanisms #EfficientLanguageModeling

arXiv.org

DeepSeek-V3 Technical Report

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training,...

439 views16:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

348 views18:56

AI & ML Papers

Photo

🔥 LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

💡 The paper introduces LlamaFactory, a unified framework that enables efficient fine-tuning of large language models across various tasks. The problem addressed is that fine-tuning these models requires significant effort and coding expertise, which can be a barrier for many users. To solve this, LlamaFactory integrates a suite of cutting-edge efficient training methods, allowing users to customize the fine-tuning of over 100 language models without needing to write code. This is made possible through a web-based user interface called LlamaBoard, which provides a flexible and user-friendly way to fine-tune language models. The authors validate the efficiency and effectiveness of LlamaFactory on language modeling and text generation tasks, demonstrating its potential. The framework has been released publicly and has already gained significant attention, with over 13,000 stars and 1,600 forks on GitHub. Overall, LlamaFactory contributes to the field by providing a unified and accessible way to fine-tune large language models, making it easier for researchers and practitioners to adapt these models to specific tasks and applications.

📅 Published on Mar 20, 2024

🔗 Links:
• arXiv: https://arxiv.org/abs/2403.13372
• PDF: https://arxiv.org/pdf/2403.13372
• Project Page: https://huggingface.co/spaces/hiyouga/LLaMA-Board
• GitHub: https://github.com/hiyouga/LLaMA-Factory ⭐ 70.9k

🤖 Models citing this paper:
• https://huggingface.co/AELLM/Llama-3.2-Chibi-3B
• https://huggingface.co/GXMZU/Qwen3-14B-ai-expert
• https://huggingface.co/Xin-Rui/LLAMA-Fac-NEW-A800

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/hiyouga/LLaMA-Board
• https://huggingface.co/spaces/Justinrune/LLaMA-Factory
• https://huggingface.co/spaces/Darok/Featherless-Feud

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#EfficientFineTuning #LanguageModelOptimization #UnifiedTrainingFrameworks #LargeLanguageModelDevelopment #AutomatedModelCustomization

arXiv.org

LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

Efficient fine-tuning is vital for adapting large language models (LLMs) to downstream tasks. However, it requires non-trivial efforts to implement these methods on different models. We present...

397 views18:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

300 views20:57

AI & ML Papers

Photo

🔥 LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

💡 The paper introduces LeWorldModel, a stable end to end joint embedding predictive architecture that trains efficiently from raw pixels. Existing methods for learning world models in compact latent spaces are fragile and rely on complex loss terms, pre trained encoders, or auxiliary supervision to avoid representation collapse. LeWorldModel addresses this issue by using only two loss terms, a next embedding prediction loss and a regularizer, to train the model end to end from raw pixels. This approach reduces the number of tunable loss hyperparameters from six to one compared to existing methods. The model has approximately 15 million parameters and can be trained on a single GPU in a few hours, making it up to 48 times faster than foundation model based world models. The results show that LeWorldModel remains competitive across diverse 2D and 3D control tasks and encodes meaningful physical structures in its latent space. The model is also able to reliably detect physically implausible events, demonstrating its ability to learn a robust and generalizable representation of the world. Overall, LeWorldModel provides a stable and efficient framework for learning world models from raw pixels, making it a significant contribution to the field of artificial intelligence.

📅 Published on Mar 13

🔗 Links:
• arXiv: https://arxiv.org/abs/2603.19312
• PDF: https://arxiv.org/pdf/2603.19312
• Project Page: https://le-wm.github.io/
• GitHub: https://github.com/lucas-maes/le-wm ⭐ 3.1k

🤖 Models citing this paper:
• https://huggingface.co/quentinll/lewm-pusht
• https://huggingface.co/aguennoune17/atlas-v2-nwm-fp8-compressed
• https://huggingface.co/quentinll/lewm-tworooms

📊 Datasets citing this paper:
• https://huggingface.co/datasets/quentinll/lewm-pusht
• https://huggingface.co/datasets/quentinll/lewm-cube
• https://huggingface.co/datasets/quentinll/lewm-reacher

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#WorldModels #JointEmbedding #PredictiveArchitectures #EndToEndLearning #LatentSpaceRepresentation

arXiv.org

LeWorldModel: Stable End-to-End Joint-Embedding Predictive...

Joint Embedding Predictive Architectures (JEPAs) offer a compelling framework for learning world models in compact latent spaces, yet existing methods remain fragile, relying on complex multi-term...

❤1

355 views20:57

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform