AI & ML Papers
32.9K subscribers
7.09K photos
529 videos
24 files
7.75K links
Advancing research in Machine Learning – practical insights, tools, and techniques for researchers.

Admin: @HusseinSheikho || @Hussein_Sheikho
Download Telegram
UI2Code^N: A Visual Language Model for Test-Time Scalable Interactive UI-to-Code Generation

📝 Summary:
UI2Code^N is a visual language model trained for interactive UI-to-code generation, editing, and polishing. It uses multi-turn feedback to achieve state-of-the-art performance among open-source models, comparable to leading closed-source solutions.

🔹 Publication Date: Published on Nov 11

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2511.08195
• PDF: https://arxiv.org/pdf/2511.08195
• Project Page: https://zheny2751-dotcom.github.io/ui2code-n.github.io/
• Github: https://zheny2751-dotcom.github.io/ui2code-n.github.io/

🔹 Models citing this paper:
https://huggingface.co/zai-org/UI2Code_N

Spaces citing this paper:
https://huggingface.co/spaces/zai-org/UI2Code_N-demo-case

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#UI2Code #VisualLanguageModels #CodeGeneration #AI #SoftwareEngineering
Media is too big
VIEW IN TELEGRAM
VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

📝 Summary:
VisGym introduces 17 environments to evaluate VLM performance in multi-step visual interactions. Current models struggle, especially with long contexts and visual symbolic tasks. Explicit goals and demonstrations offer pathways for improvement.

🔹 Publication Date: Published on Jan 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2601.16973
• PDF: https://arxiv.org/pdf/2601.16973
• Project Page: https://visgym.github.io/
• Github: https://visgym.github.io/

==================================

For more data science resources:
https://xn--r1a.website/DataScienceT

#MultimodalAI #VisualLanguageModels #AIenvironments #ComputerVision #AIResearch
1
🔥 S-Agent: Spatial Tool-Use Elicits Reasoning for Spatial Intelligence

💡 The paper introduces S-Agent, a spatial reasoning framework that enhances visual language models to enable continuous 3D world understanding from multi-view imagery. The problem addressed is that existing visual language models and tool-augmented agents are limited to static and stateless inference from isolated visual observations, which is insufficient for real-world spatial intelligence.

The S-Agent method involves formulating spatial reasoning as spatio-temporal evidence accumulation, rather than isolated frame-level prediction. This is achieved by casting the visual language model as a semantic planner that decides what evidence is needed, while a hierarchy of spatial tools and experts grounds objects in 2D, lifts them into 3D geometric evidence, and aggregates this evidence into high-level spatial knowledge. The framework also includes a temporal memory mechanism, comprising scene memory and agent memory, which enables evidence integration across frames and reasoning steps.

The results show that S-Agent consistently improves both open-source and closed-source visual language models in a training-free manner. Additionally, supervised fine-tuning on S-Agent-generated spatial trajectories yields S-Agent-8B, a compact spatial agent that significantly surpasses similar-scale baselines and performs comparably to advanced closed-source models. The comprehensive experiments on multi-view and video spatial reasoning benchmarks demonstrate the effectiveness of the S-Agent framework in enhancing spatial intelligence. Overall, the paper contributes a novel spatial tool-use agentic paradigm for understanding and reasoning over continuous multi-view images and videos, which has the potential to improve real-world spatial intelligence applications.


📅 Published on Jun 18

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.20515
• PDF: https://arxiv.org/pdf/2606.20515
• Project Page: https://ropedia.github.io/S-Agent

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#SpatialReasoning #VisualLanguageModels #3DWorldUnderstanding #SpatioTemporalEvidence #ToolUseInAI