AI & ML Papers

✨ReconPhys: Reconstruct Appearance and Physical Attributes from Single Video

📝 Summary:
ReconPhys is the first feedforward framework to jointly learn physical attribute estimation and 3D Gaussian Splatting reconstruction from a single video. It offers significantly faster inference and superior reconstruction quality for non-rigid objects compared to prior optimization-based methods...

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.07882
• PDF: https://arxiv.org/pdf/2604.07882
• Project Page: https://chuanshuogushi.github.io/ReconPhys/
• Github: https://chuanshuogushi.github.io/ReconPhys/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#ComputerVision #3DReconstruction #GaussianSplatting #DeepLearning #AIResearch

213 views07:03

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨VEFX-Bench: A Holistic Benchmark for Generic Video Editing and Visual Effects

📝 Summary:
VEFX-Bench offers a large human-annotated video editing dataset and VEFX-Reward, a specialized model for quality assessment. This benchmark allows standardized comparison, showing current models struggle with instruction following and edit locality.

🔹 Publication Date: Published on Apr 17

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.16272
• PDF: https://arxiv.org/pdf/2604.16272
• Project Page: https://xiangbogaobarry.github.io/VEFX-Bench/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VideoEditing #VFX #AI #ComputerVision #Benchmarks

243 views02:00

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨NTIRE 2026 Challenge on Video Saliency Prediction: Methods and Results

📝 Summary:
This paper overviews the NTIRE 2026 Challenge on Video Saliency Prediction. Participants developed automatic saliency map prediction for videos using a novel 2,000-video dataset with crowdsourced fixations. Over 20 teams submitted, and all challenge data is now publicly available.

🔹 Publication Date: Published on Apr 16

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.14816
• PDF: https://arxiv.org/pdf/2604.14816
• Project Page: https://www.codabench.org/competitions/12842/
• Github: https://github.com/msu-video-group/NTIRE26_Saliency_Prediction

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VideoSaliency #ComputerVision #NTIRE #MachineLearning #SaliencyPrediction

187 views08:03

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

📝 Summary:
This paper improves vision-language models for compositional reasoning by using concreteness-based negative sample selection and a novel margin-based loss. Their framework, Slipform, achieves state-of-the-art accuracy on compositional benchmarks and cross-modal retrieval.

🔹 Publication Date: Published on Apr 14

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.13313
• PDF: https://arxiv.org/pdf/2604.13313

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisionLanguage #DeepLearning #AIResearch #ComputerVision #NLP

268 views10:07

✨ Explore Data Science 📝 Write your paper

✨CityRAG: Stepping Into a City via Spatially-Grounded Video Generation

📝 Summary:
CityRAG generates long-term, physically grounded video sequences that maintain environmental consistency and support complex navigation through real-world geography using geo-registered data as contex...

🔹 Publication Date: Published on Apr 21

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.19741
• PDF: https://arxiv.org/pdf/2604.19741
• Project Page: https://cityrag.github.io/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VideoGeneration #GenerativeAI #SpatialAI #ComputerVision #UrbanSimulation

❤1

293 views14:08

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

0:08

This media is not supported in your browser

VIEW IN TELEGRAM

✨DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation

📝 Summary:
DeVI enables physically plausible dexterous robot control by leveraging text-conditioned synthetic videos through a hybrid tracking reward that combines 3D and 2D tracking for improved hand-object int...

🔹 Publication Date: Published on Apr 22

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.20841
• PDF: https://arxiv.org/pdf/2604.20841
• Project Page: https://snuvclab.github.io/devi/
• Github: https://github.com/snuvclab/devi

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#Robotics #AI #ComputerVision #HumanRobotInteraction #DeepLearning

294 views09:04

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through Visual Contrastive Decoding

📝 Summary:
3D-VCD is a new inference-time framework that reduces hallucinations in 3D embodied agents. It constructs distorted 3D scene graphs and contrasts predictions to suppress ungrounded tokens. This improves reasoning on 3D benchmarks without retraining.

🔹 Publication Date: Published on Apr 9

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.08645
• PDF: https://arxiv.org/pdf/2604.08645
• Project Page: https://plan-lab.github.io/projects/3d-vcd

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#3DLLM #EmbodiedAI #HallucinationMitigation #ComputerVision #AIResearch

arXiv.org

3D-VCD: Hallucination Mitigation in 3D-LLM Embodied Agents through...

Large multimodal models are increasingly used as the reasoning core of embodied agents operating in 3D environments, yet they remain prone to hallucinations that can produce unsafe and ungrounded...

407 views17:09

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨FlowAnchor: Stabilizing the Editing Signal for Inversion-Free Video Editing

📝 Summary:
FlowAnchor stabilizes inversion-free video editing by addressing signal instability in high-dimensional latent spaces. It uses spatial-aware attention refinement and adaptive magnitude modulation to ensure precise localization and sufficient editing strength, leading to faithful and coherent vide...

🔹 Publication Date: Published on Apr 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22586
• PDF: https://arxiv.org/pdf/2604.22586

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VideoEditing #DeepLearning #ComputerVision #GenerativeAI #AIResearch

216 views04:02

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

0:27

This media is not supported in your browser

VIEW IN TELEGRAM

✨Video Analysis and Generation via a Semantic Progress Function

📝 Summary:
Researchers developed a Semantic Progress Function to analyze and correct non-linear semantic evolution in generated media. This function identifies uneven pacing, enabling a linearization procedure that re-times sequences for smoother, more coherent transitions at a constant semantic rate.

🔹 Publication Date: Published on Apr 24

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22554
• PDF: https://arxiv.org/pdf/2604.22554
• Project Page: https://sagipolaczek.github.io/semantic-progress-function/
• Github: https://github.com/SagiPolaczek/semantic-progress-function

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VideoAI #GenerativeAI #ComputerVision #SemanticAnalysis #AIResearch

272 views13:04

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

3:48

This media is not supported in your browser

VIEW IN TELEGRAM

✨SketchVLM: Vision language models can annotate images to explain thoughts and guide users

📝 Summary:
SketchVLM is a training-free framework that enables vision-language models to generate editable SVG overlays for visual explanations, improving reasoning accuracy and annotation quality across multipl...

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22875
• PDF: https://arxiv.org/pdf/2604.22875
• Project Page: https://sketchvlm.github.io/

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#SketchVLM #VisionLanguageModels #ComputerVision #AI #ImageAnnotation

219 views03:01

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Sapiens2

📝 Summary:
Sapiens2 is a high-resolution transformer model for human-centric vision. It achieves state-of-the-art performance by combining unified pretraining objectives, a large 1-billion image dataset, and architectural improvements, excelling in tasks like pose and segmentation.

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.21681
• PDF: https://arxiv.org/pdf/2604.21681
• Github: https://github.com/facebookresearch/sapiens2

🔹 Models citing this paper:
• https://huggingface.co/facebook/sapiens2
• https://huggingface.co/facebook/sapiens2-seg-5b
• https://huggingface.co/facebook/sapiens2-seg-1b

✨ Spaces citing this paper:
• https://huggingface.co/spaces/facebook/sapiens2-seg
• https://huggingface.co/spaces/facebook/sapiens2-pointmap
• https://huggingface.co/spaces/facebook/sapiens2-normal

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#Sapiens2 #ComputerVision #TransformerModels #HumanCentricAI #DeepLearning

arXiv.org

Sapiens2

We present Sapiens2, a model family of high-resolution transformers for human-centric vision focused on generalization, versatility, and high-fidelity outputs. Our model sizes range from 0.4 to 5...

204 views20:16

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

✨Probing Visual Planning in Image Editing Models

📝 Summary:
This paper redefines visual planning as a single-step image transformation using abstract puzzles for evaluation. Their EAR paradigm and AMAZE dataset reveal that current AI models, despite finetuning, cannot match human zero-shot efficiency, highlighting a gap in visual reasoning.

🔹 Publication Date: Published on Apr 23

🔹 Paper Links:
• arXiv Page: https://arxiv.org/abs/2604.22868
• PDF: https://arxiv.org/pdf/2604.22868
• Project Page: https://spatigen.github.io/amaze.io/
• Github: https://github.com/spatigen/amaze

==================================

For more data science resources:
✓ https://xn--r1a.website/DataScienceT

#VisualPlanning #ImageEditing #ComputerVision #AIResearch #MachineLearning

326 views11:03

✨ Explore Data Science 📝 Write your paper

AI & ML Papers

Photo

🔥 SAM 3: Segment Anything with Concepts

💡 The paper introduces Segment Anything Model 3, a unified model that detects, segments, and tracks objects in images and videos based on concept prompts. The model achieves state-of-the-art performance in promptable concept segmentation and tracking by leveraging a unified model architecture with decoupled recognition and localization. The concept prompts can be short noun phrases, image exemplars, or a combination of both, and the model returns segmentation masks and unique identities for all matching object instances.

To advance promptable concept segmentation, the authors built a scalable data engine that produces a high-quality dataset with 4 million unique concept labels, including hard negatives, across images and videos. The model consists of an image-level detector and a memory-based video tracker that share a single backbone. The recognition and localization are decoupled with a presence head, which boosts detection accuracy.

The results show that Segment Anything Model 3 doubles the accuracy of existing systems in both image and video promptable concept segmentation, and improves previous capabilities on visual segmentation tasks. The authors also open source Segment Anything Model 3 along with a new benchmark for promptable concept segmentation, called Segment Anything with Concepts.

The main contributions of the paper are the introduction of a unified model architecture that achieves state-of-the-art performance in promptable concept segmentation and tracking, the creation of a large-scale dataset with unique concept labels, and the development of a new benchmark for evaluating promptable concept segmentation models. Overall, the paper presents a significant advancement in the field of computer vision and object segmentation, enabling more accurate and efficient detection, segmentation, and tracking of objects in images and videos based on concept prompts.

📅 Published on Nov 20, 2025

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2511.16719
• PDF: https://arxiv.org/pdf/2511.16719
• Project Page: https://ai.meta.com/sam3/

🤖 Models citing this paper:
• https://huggingface.co/AllanVester/SAM3.1-CoreML-FP16
• https://huggingface.co/AllanVester/SAM3.1-CoreML
• https://huggingface.co/embedl/sam3

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/kith777/rag_agent

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ObjectSegmentation #ConceptLearning #ImageTracking #PromptableSegmentation

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

422 views07:52

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 PhysX-Omni: Unified Simulation-Ready Physical 3D Generation for Rigid, Deformable, and Articulated Objects

💡 The paper introduces PhysX-Omni, a unified framework for generating simulation-ready 3D assets with physical properties across multiple categories. The problem addressed is that existing 3D generation methods either neglect physical properties or are limited to a single asset category, such as rigid, deformable, or articulated objects. To address this, the authors develop a novel geometry representation tailored for vision-language models, which directly encodes high-resolution 3D structures without compression, significantly improving generation performance.

The PhysX-Omni framework generates simulation-ready physical 3D assets using this novel geometry representation. The authors also construct the first general simulation-ready 3D dataset, PhysXVerse, covering diverse indoor and outdoor categories. To evaluate the framework, they propose PhysX-Bench, a benchmark that encompasses six key attributes: geometry, absolute scale, material, affordance, kinematics, and function description.

The results show that PhysX-Omni performs strongly in both generation and understanding, outperforming conventional metrics and PhysX-Bench. Additional studies validate the potential of PhysX-Omni for applications in simulation-ready scene generation and robotic policy learning. The authors believe that PhysX-Omni can significantly advance a wide range of downstream applications, particularly in embodied AI and physics-based simulation.

The key contributions of the paper are the development of a novel geometry representation, the construction of the PhysXVerse dataset, and the proposal of the PhysX-Bench benchmark. These contributions enable the generation of simulation-ready physical 3D assets across multiple categories, which can be used in various applications such as robotics, computer vision, and simulation. Overall, the paper presents a significant advancement in the field of 3D generation and simulation, with potential applications in a wide range of areas.

📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21572
• PDF: https://arxiv.org/pdf/2605.21572
• Project Page: https://physx-omni.github.io

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #3DModeling #PhysicsBasedSimulation #ArticulatedObjectSimulation #DeformableObjectModeling

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

546 views23:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Photo

🔥 GenEvolve: Self-Evolving Image Generation Agents via Tool-Orchestrated Visual Experience Distillation

💡 The paper proposes a self-evolving image generation framework called GenEvolve that improves generative capabilities through iterative learning and reference-based prompting. The problem addressed is that high-quality image generation often requires combining a model's internal generative ability with external resources, and existing methods have limitations in handling diverse and demanding requests.

The GenEvolve framework models each generation attempt as a tool-orchestrated trajectory, where the agent gathers evidence, selects references, invokes generation skills, and composes them into a prompt-reference program. Unlike existing methods that rely on image-level scalar rewards, GenEvolve compares multiple trajectories for the same request and abstracts best-worst differences into structured visual experience.

This visual experience is provided to a privileged teacher branch, which uses visual experience distillation to provide dense token-level supervision to a student branch. This helps the student internalize better search, knowledge activation, reference selection, and prompt construction. The authors also construct GenEvolve-Data and GenEvolve-Bench to evaluate the framework.

The results show that GenEvolve achieves substantial gains over strong baselines, achieving state-of-the-art performance among current image-generation frameworks. The experiments on public benchmarks and GenEvolve-Bench demonstrate the effectiveness of the proposed framework. Overall, the paper contributes a novel self-evolving image generation framework that can effectively handle diverse and demanding generation challenges.

📅 Published on May 20

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.21605
• PDF: https://arxiv.org/pdf/2605.21605
• Project Page: https://ephemeral182.github.io/GenEvolve/

🤖 Models citing this paper:
• https://huggingface.co/MeiGen-AI/GenEvolve

📊 Datasets citing this paper:
• https://huggingface.co/datasets/MeiGen-AI/GenEvolve-Data-Bench

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #ImageGeneration #GenerativeModels #SelfEvolvingSystems #DeepLearning

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

688 views23:51

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 TriSplat: Simulation-Ready Feed-Forward 3D Scene Reconstruction

💡 The paper presents TriSplat, a feed-forward 3D reconstruction network that generates simulation-ready meshes from single images. The problem addressed is that existing methods for 3D reconstruction require expensive post-processing steps to extract a usable mesh for simulation or physics reasoning. Most existing methods use Gaussian primitives and do not directly expose surfaces, making it difficult to obtain a simulation-ready mesh.

The method proposed in the paper uses oriented triangle primitives to represent scenes and directly exports simulation-ready mesh scenes from a single forward pass. The network predicts local 3D point maps, triangle attributes, camera poses, and optional intrinsics from input images. The approach constructs geometry normals from the predicted point maps, refines them with an image-conditioned normal head, and converts them into stable local frames for triangle parameterization.

The results show that the proposed representation produces more geometry-faithful reconstructions than Gaussian feed-forward baselines while maintaining competitive novel-view rendering quality. The output of the network can be directly ingested by physics engines, collision detectors, and standard rendering pipelines without any conversion, making it a practical simulation-ready solution for feed-forward 3D scene reconstruction. The experiments were conducted on RealEstate10K and DL3DV datasets and demonstrate the effectiveness of the proposed approach. Overall, the paper contributes a novel method for 3D scene reconstruction that bypasses expensive post-processing steps and directly generates simulation-ready meshes from single images.

📅 Published on May 25

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2605.26115
• PDF: https://arxiv.org/pdf/2605.26115
• Project Page: https://lhmd.top/trisplat/#interactive

🤖 Models citing this paper:
• https://huggingface.co/lhmd/TriSplat

📊 Datasets citing this paper:
• https://huggingface.co/datasets/lhmd/re10k_torch

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#3DSceneReconstruction #SimulationReadyMeshes #FeedForwardNetworks #TrianglePrimitives #ComputerVision

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

605 views17:49

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 A Very Big Video Reasoning Suite

💡 The paper introduces a large scale video reasoning dataset and benchmark to study video intelligence capabilities beyond visual quality. The problem addressed is that current video models have focused on visual quality and their reasoning capabilities have been underexplored. Video reasoning involves understanding spatiotemporal structure such as continuity, interaction, and causality, which is essential for intelligent systems. However, the lack of large scale training data has hindered systematic study of video reasoning.

To address this gap, the authors introduce the Very Big Video Reasoning Dataset, which is an unprecedentedly large scale resource consisting of 200 curated reasoning tasks and over one million video clips. This dataset is approximately three orders of magnitude larger than existing datasets. The authors also present VBVR-Bench, a verifiable evaluation framework that incorporates rule-based, human-aligned scorers to enable reproducible and interpretable diagnosis of video reasoning capabilities.

The results of the study show early signs of emergent generalization to unseen reasoning tasks, indicating that the proposed dataset and benchmark can be used to develop more generalizable video reasoning models. The dataset, benchmark toolkit, and models are publicly available, laying a foundation for the next stage of research in generalizable video reasoning. The contributions of the paper are the introduction of a large scale video reasoning dataset and benchmark, and the demonstration of their effectiveness in studying video reasoning capabilities and enabling the development of more generalizable models.

📅 Published on Feb 23

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2602.20159
• PDF: https://arxiv.org/pdf/2602.20159
• Project Page: https://video-reason.com/

🤖 Models citing this paper:
• https://huggingface.co/Video-Reason/VBVR-Wan2.2
• https://huggingface.co/Video-Reason/VBVR-LTX2.3-diffsynth
• https://huggingface.co/Video-Reason/VBVR-Wan2.1-diffsynth

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Video-Reason/VBVR-Dataset
• https://huggingface.co/datasets/Video-Reason/VBVR-Bench-Data
• https://huggingface.co/datasets/Video-Reason/video-mcp

🚀 Spaces citing this paper:
• https://huggingface.co/spaces/Video-Reason/VBVR-Bench-Leaderboard

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#VideoIntelligence #VideoReasoning #SpatiotemporalAnalysis #CausalityInAI #ComputerVision

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

388 views13:50

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 SCAIL-2: Unifying Controlled Character Animation with End-to-end In-Context Conditioning

💡 The paper presents SCAIL-2, a framework for controlled character animation that enables end-to-end motion transfer from driving videos to reference characters without using intermediate representations. Prior methods relied on intermediate representations such as pose skeletons or masked backgrounds, which led to information loss. SCAIL-2 addresses this issue by directly concatenating driving videos to the sequence, allowing the model to obtain all required visual information from the input video.

To overcome the lack of end-to-end data, the authors unify sub-tasks of character animation with decoupled conditions and create a pipeline to synthesize a large dataset called MotionPair-60K, which contains heterogeneous tasks of character animation. The framework utilizes in-context mask conditioning and mode-specific RoPE as soft guidance beyond textual instructions and raw visual information.

The authors also propose Bias-Aware DPO to mitigate errors caused by synthetic discrepancies in detailed regions. This approach constructs preference items to address the issue. Extensive experiments demonstrate that SCAIL-2 substantially outperforms existing state-of-the-art approaches in various character animation tasks.

The key contributions of the paper are the development of an end-to-end character animation framework that bypasses intermediate representations, the creation of a large synthetic dataset for motion transfer, and the proposal of a novel method to address synthetic discrepancies. The results show that SCAIL-2 achieves superior performance compared to existing methods, and the authors plan to release a large subset of synthetic data and model weights to facilitate further research.

📅 Published on Jun 9

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2606.10804
• PDF: https://arxiv.org/pdf/2606.10804
• Project Page: https://teal024.github.io/SCAIL-2/

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#CharacterAnimation #MotionTransfer #EndToEndLearning #InContextConditioning #ComputerVision

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

437 views17:54

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

Photo

🔥 Vision as Unified Multimodal Generation

💡 The paper introduces a unified multimodal model that formulates computer vision tasks as generation problems using natural language and visual prompts. This approach allows for a single model to perform a wide range of vision tasks without requiring task-specific architectures. The model, called SenseNova-Vision, uses natural-language instructions and optional visual prompts to specify tasks and generates responses as text, images, or mixed text-and-image outputs. To support large-scale training, the authors created the SenseNova-Vision Corpus, a computer-vision instruction-response corpus that spans text, image, and mixed targets. The model is trained on this corpus, along with auxiliary multimodal data, and achieves performance comparable to specialized systems across diverse vision tasks, including detection, OCR, keypoint estimation, segmentation, and camera pose estimation. The results demonstrate that a single unified model can match leading task-specialized systems, suggesting that unified multimodal generation is a scalable route for integrating computer vision capabilities into general-purpose foundation models. The model and corpus are publicly available, providing a valuable resource for the research community. Overall, the paper presents a significant contribution to the field of computer vision, offering a unified and flexible approach to tackling a wide range of vision tasks.

📅 Published on Jul 7

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.06560
• PDF: https://arxiv.org/pdf/2607.06560

🤖 Models citing this paper:
• https://huggingface.co/sensenova/SenseNova-Vision-7B-MoT

📊 Datasets citing this paper:
• https://huggingface.co/datasets/sensenova/SenseNova-Vision-Corpus-50M
• https://huggingface.co/datasets/sensenova/SenseNova-Vision-Benchmark

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#MultimodalGeneration #VisionTasks #NaturalLanguageProcessing #ComputerVision #MultimodalLearning

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

482 views21:55

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

AI & ML Papers

🔥 ReDesign: Recovering Editable Design Structures from Images via Agentic Decomposition

💡 The paper ReDesign presents a novel approach to recovering editable design structures from images, a common and costly bottleneck in modern design workflows. The problem is challenging because it requires recovering multiple attributes such as typography, vector geometry, colors, grouping, and layer ordering. The proposed method, ReDesign, uses an agentic framework that grows an editable layer hierarchy by selecting and composing specialized tools across modalities. To ensure reliability despite imperfect tool outputs, the framework introduces a verification mechanism at each expansion step, providing local accept, prune, or retry feedback that prevents error accumulation and avoids large-scale reruns.

The authors evaluate the method's editability at scale using the Figma Edit Replay Benchmark, consisting of 909 raw Figma files and 14796 controlled edit instructions that replay edits on reconstructed outputs. The results show that ReDesign achieves strong visual fidelity while delivering the highest editability across layout, color, and text edits, outperforming layered decomposition baselines and serial tool use pipelines. The paper's contributions include the introduction of the ReDesign framework, the Figma Edit Replay Benchmark, and the demonstration of the method's effectiveness in recovering editable design structures from images. Overall, the paper presents a significant advancement in the field of design recovery and editing, with potential applications in various design workflows.

📅 Published on Jul 28

🔗 Links:
• GitHub: https://github.com/huggingface
• arXiv: https://arxiv.org/abs/2607.25565
• PDF: https://arxiv.org/pdf/2607.25565
• Project Page: https://jintae-00.github.io/ReDesign/

📊 Datasets citing this paper:
• https://huggingface.co/datasets/Jintae-Park/ReDesign-Figma909

━━━━━━━━━━━━━━━━━━━━━━━━
📢 By: https://xn--r1a.website/PaperNexus

#ComputerVision #GraphicDesignAutomation #ImageProcessing #VectorGraphicsRecovery #DesignStructureExtraction

GitHub

Hugging Face

The AI community building the future. Hugging Face has 458 repositories available. Follow their code on GitHub.

329 views05:56

✨ Join Best TG Channels

👋 Join Our WhatsApp Channel

📝 Contact / Collaborate

About

Blog

Apps

Platform