This media is not supported in your browser
VIEW IN TELEGRAM
𦩠WildRGB-D: Objects in the Wild π¦©
π#NVIDIA unveils a novel RGB-D object dataset captured in the wild: ~8500 recorded objects, ~20,000 RGBD videos, 46 categories with corresponding masks and 3D point clouds.
πReview https://t.ly/WCqVz
πData github.com/wildrgbd/wildrgbd
πPaper arxiv.org/pdf/2401.12592.pdf
πProject wildrgbd.github.io/
π#NVIDIA unveils a novel RGB-D object dataset captured in the wild: ~8500 recorded objects, ~20,000 RGBD videos, 46 categories with corresponding masks and 3D point clouds.
πReview https://t.ly/WCqVz
πData github.com/wildrgbd/wildrgbd
πPaper arxiv.org/pdf/2401.12592.pdf
πProject wildrgbd.github.io/
π9β€3π₯2π1π€©1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π Up to 69x Faster SAM π
πEfficientViT-SAM is a new family of accelerated Segment Anything Models. The same old SAMβs lightweight prompt encoder and mask decoder, while replacing the heavy image encoder with EfficientViT. Up to 69x faster, source code released. Authors: Tsinghua, MIT & #Nvidia
πReview https://t.ly/zGiE9
πPaper arxiv.org/pdf/2402.05008.pdf
πCode github.com/mit-han-lab/efficientvit
πEfficientViT-SAM is a new family of accelerated Segment Anything Models. The same old SAMβs lightweight prompt encoder and mask decoder, while replacing the heavy image encoder with EfficientViT. Up to 69x faster, source code released. Authors: Tsinghua, MIT & #Nvidia
πReview https://t.ly/zGiE9
πPaper arxiv.org/pdf/2402.05008.pdf
πCode github.com/mit-han-lab/efficientvit
π₯19π7β€4π₯°1
This media is not supported in your browser
VIEW IN TELEGRAM
π BodyMAP: human body & pressure π
π#Nvidia (+CMU) unveils BodyMAP, the new SOTA in predicting body mesh (3D pose & shape) and 3D applied pressure on the human body. Source Code released, Dataset coming π
πReview https://t.ly/8926S
πProject bodymap3d.github.io/
πPaper https://lnkd.in/gCxH4ev3
πCode https://lnkd.in/gaifdy3q
π#Nvidia (+CMU) unveils BodyMAP, the new SOTA in predicting body mesh (3D pose & shape) and 3D applied pressure on the human body. Source Code released, Dataset coming π
πReview https://t.ly/8926S
πProject bodymap3d.github.io/
πPaper https://lnkd.in/gCxH4ev3
πCode https://lnkd.in/gaifdy3q
β€8π€―4β‘1π1π₯1
πGradient Boosting Reinforcement Learningπ
π#Nvidia unveils GBRL, a framework that extends the advantages of Gradient Boosting Trees to the RL domain. GBRL adapts the power of Gradient Boosting Trees to the unique challenges of RL environments, including non-stationarity and absence of predefined targets. Code releasedπ
πReview https://t.ly/zv9pl
πPaper https://arxiv.org/pdf/2407.08250
πCode https://github.com/NVlabs/gbrl
π#Nvidia unveils GBRL, a framework that extends the advantages of Gradient Boosting Trees to the RL domain. GBRL adapts the power of Gradient Boosting Trees to the unique challenges of RL environments, including non-stationarity and absence of predefined targets. Code releasedπ
πReview https://t.ly/zv9pl
πPaper https://arxiv.org/pdf/2407.08250
πCode https://github.com/NVlabs/gbrl
β€7π€―4π3π₯1π₯°1
This media is not supported in your browser
VIEW IN TELEGRAM
π³οΈ EVER Ellipsoid Rendering π³οΈ
πUCSD & Google present EVER, a novel method for real-time differentiable emission-only volume rendering. Unlike 3DGS it does not suffer from popping artifacts and view dependent density, achieving βΌ30 FPS at 720p on #NVIDIA RTX4090.
πReview https://t.ly/zAfGU
πPaper arxiv.org/pdf/2410.01804
πProject half-potato.gitlab.io/posts/ever/
πUCSD & Google present EVER, a novel method for real-time differentiable emission-only volume rendering. Unlike 3DGS it does not suffer from popping artifacts and view dependent density, achieving βΌ30 FPS at 720p on #NVIDIA RTX4090.
πReview https://t.ly/zAfGU
πPaper arxiv.org/pdf/2410.01804
πProject half-potato.gitlab.io/posts/ever/
π₯13β€2π2π1π€―1π±1πΎ1
This media is not supported in your browser
VIEW IN TELEGRAM
πͺRobo-Emulation via Video Imitationπͺ
πOKAMI (UT & #Nvidia) is a novel foundation method that generates a manipulation plan from a single RGB-D video and derives a policy for execution.
πReview https://t.ly/_N29-
πPaper arxiv.org/pdf/2410.11792
πProject https://lnkd.in/d6bHF_-s
πOKAMI (UT & #Nvidia) is a novel foundation method that generates a manipulation plan from a single RGB-D video and derives a policy for execution.
πReview https://t.ly/_N29-
πPaper arxiv.org/pdf/2410.11792
πProject https://lnkd.in/d6bHF_-s
π4π€―2π₯1
π₯ "Nuclear" AI vs. Hyper-Cheap Inference π₯
β What do you expect in 2025 after the #Nvidia announcements at CES 2025? Free to comment :)
β What do you expect in 2025 after the #Nvidia announcements at CES 2025? Free to comment :)
Anonymous Poll
24%
π€²Portabile Training Workstation
34%
βοΈNuclear energy for AI training
33%
π²οΈCheaper Only-inference devices
9%
π°Cloud-intensive Only-inference
π4β€1π₯1π€―1π€©1
This media is not supported in your browser
VIEW IN TELEGRAM
π§ββοΈOmni-RGPT: SOTA MLLM Understandingπ§ββοΈ
π #NVIDIA presents Omni-RGPT, MLLM for region-level comprehension for both images & videos. New SOTA on image/video-based commonsense reasoning.
πReview https://t.ly/KHnQ7
πPaper arxiv.org/pdf/2501.08326
πProject miranheo.github.io/omni-rgpt/
πRepo TBA soon
π #NVIDIA presents Omni-RGPT, MLLM for region-level comprehension for both images & videos. New SOTA on image/video-based commonsense reasoning.
πReview https://t.ly/KHnQ7
πPaper arxiv.org/pdf/2501.08326
πProject miranheo.github.io/omni-rgpt/
πRepo TBA soon
π₯10β€3πΎ2β‘1π1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π #Nvidia Foundation ZS-Stereo π
πNvidia unveils FoundationStereo, a foundation model for stereo depth estimation with strong zero-shot generalization. In addition, a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism. Code, model & dataset to be releasedπ
πReview https://t.ly/rfBr5
πPaper arxiv.org/pdf/2501.09898
πProject nvlabs.github.io/FoundationStereo/
πRepo github.com/NVlabs/FoundationStereo/tree/master
πNvidia unveils FoundationStereo, a foundation model for stereo depth estimation with strong zero-shot generalization. In addition, a large-scale (1M stereo pairs) synthetic training dataset featuring large diversity and high photorealism. Code, model & dataset to be releasedπ
πReview https://t.ly/rfBr5
πPaper arxiv.org/pdf/2501.09898
πProject nvlabs.github.io/FoundationStereo/
πRepo github.com/NVlabs/FoundationStereo/tree/master
β€6π₯6π€©1
This media is not supported in your browser
VIEW IN TELEGRAM
π₯HAMSTER: Hierarchical VLA Manipulationπ₯
π#Nvidia unveils HAMSTER: novel Hierarchical VLA architecture to enable robotic manipulation with semantic, visual & geometric generalization trained on easy to collect, off-domain data. Source Code announcedπ
πReview https://t.ly/2yXaY
πPaper https://arxiv.org/pdf/2502.05485
πProject https://hamster-robot.github.io/
πRepo TBA
π#Nvidia unveils HAMSTER: novel Hierarchical VLA architecture to enable robotic manipulation with semantic, visual & geometric generalization trained on easy to collect, off-domain data. Source Code announcedπ
πReview https://t.ly/2yXaY
πPaper https://arxiv.org/pdf/2502.05485
πProject https://hamster-robot.github.io/
πRepo TBA
π₯4β€1
This media is not supported in your browser
VIEW IN TELEGRAM
πUnified Low-Level 4D Visionπ
π#Nvidia L4P is a novel feedforward, general-purpose, architecture to solve low-level 4D perception tasks in a unified framework. L4P combines a ViTbased backbone with per-task heads that are lightweight and therefore do not require extensive training. One backbone - many SOTAs. Code announced π
πReview https://t.ly/04DGj
πPaper arxiv.org/pdf/2502.13078
πProject research.nvidia.com/labs/lpr/l4p/
πRepo TBA
π#Nvidia L4P is a novel feedforward, general-purpose, architecture to solve low-level 4D perception tasks in a unified framework. L4P combines a ViTbased backbone with per-task heads that are lightweight and therefore do not require extensive training. One backbone - many SOTAs. Code announced π
πReview https://t.ly/04DGj
πPaper arxiv.org/pdf/2502.13078
πProject research.nvidia.com/labs/lpr/l4p/
πRepo TBA
π₯5π2β€1π€―1π€©1
This media is not supported in your browser
VIEW IN TELEGRAM
π½Neural-Free Sparse Voxels Rasterizationπ½
π#Nvidia unveils a novel efficient radiance field rendering algorithm that incorporates a rasterization process on adaptive sparse voxels without neural networks or 3D Gaussians. Code released (custom license)π
πReview https://t.ly/Nh_ic
πPaper https://lnkd.in/g8k8Zs6R
πProject https://lnkd.in/gR-bD4Wx
πRepo https://lnkd.in/gNHX-w4t
π#Nvidia unveils a novel efficient radiance field rendering algorithm that incorporates a rasterization process on adaptive sparse voxels without neural networks or 3D Gaussians. Code released (custom license)π
πReview https://t.ly/Nh_ic
πPaper https://lnkd.in/g8k8Zs6R
πProject https://lnkd.in/gR-bD4Wx
πRepo https://lnkd.in/gNHX-w4t
π₯15π4π€©1
This media is not supported in your browser
VIEW IN TELEGRAM
π3D MultiModal Memoryπ
πM3 is a novel framework by UCSD & #NVIDIA for rendering 3D scenes w/ RGB & foundation model embeddings. Rich spatial & semantic understanding via novel memory system designed to retain multimodal info through videos
πReview https://t.ly/OrXZO
πPaper arxiv.org/pdf/2503.16413
πProject https://lnkd.in/dXAZ97KH
πRepo https://lnkd.in/dWvunCET
πM3 is a novel framework by UCSD & #NVIDIA for rendering 3D scenes w/ RGB & foundation model embeddings. Rich spatial & semantic understanding via novel memory system designed to retain multimodal info through videos
πReview https://t.ly/OrXZO
πPaper arxiv.org/pdf/2503.16413
πProject https://lnkd.in/dXAZ97KH
πRepo https://lnkd.in/dWvunCET
π₯10β€4π1π1
π¦ Scaling Vision to 4Kπ¦
πPS3 by #Nvidia (+UC Berkeley) to scale-up CLIP-style vision pre-training to 4K with *near-constant* cost. Encoding LR global image and selectively processes only informative HR regions. Impressive work. Code/weights & π€ announcedπ
πReview https://t.ly/WN479
πPaper https://lnkd.in/ddWq8UpX
πProject https://lnkd.in/dMkTY8-k
πRepo https://lnkd.in/d9YSB6yv
πPS3 by #Nvidia (+UC Berkeley) to scale-up CLIP-style vision pre-training to 4K with *near-constant* cost. Encoding LR global image and selectively processes only informative HR regions. Impressive work. Code/weights & π€ announcedπ
πReview https://t.ly/WN479
πPaper https://lnkd.in/ddWq8UpX
πProject https://lnkd.in/dMkTY8-k
πRepo https://lnkd.in/d9YSB6yv
π₯14β€4π2π1
This media is not supported in your browser
VIEW IN TELEGRAM
πPartField #3D Part Segmentationπ
π#Nvidia unveils PartField, a FFW approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy. Suitable for single-shape decomposition, co-segm., correspondence & more. Code & Models released under Nvidia Licenseπ
πReview https://t.ly/fGb2O
πPaper https://lnkd.in/dGeyKSzG
πCode https://lnkd.in/dbe57XGH
πProject https://lnkd.in/dhEgf7X2
π#Nvidia unveils PartField, a FFW approach for learning part-based 3D features, which captures the general concept of parts and their hierarchy. Suitable for single-shape decomposition, co-segm., correspondence & more. Code & Models released under Nvidia Licenseπ
πReview https://t.ly/fGb2O
πPaper https://lnkd.in/dGeyKSzG
πCode https://lnkd.in/dbe57XGH
πProject https://lnkd.in/dhEgf7X2
β€2π₯2π€―2
This media is not supported in your browser
VIEW IN TELEGRAM
𦧠#Nvidia Describe Anything π¦§
πNvidia unveils Describe Anything Model (DAM) the new SOTA in generating detailed descriptions for user-specified regions in images/videos, marked by points, boxes, scribbles, or masks. Repo under Apache, Dataset available and live demo on π€
πReview https://t.ly/la4JD
πPaper https://lnkd.in/dZh82xtV
πProject https://lnkd.in/dcv9V2ZF
πRepo https://lnkd.in/dJB9Ehtb
π€Demo https://lnkd.in/dXDb2MWU
πNvidia unveils Describe Anything Model (DAM) the new SOTA in generating detailed descriptions for user-specified regions in images/videos, marked by points, boxes, scribbles, or masks. Repo under Apache, Dataset available and live demo on π€
πReview https://t.ly/la4JD
πPaper https://lnkd.in/dZh82xtV
πProject https://lnkd.in/dcv9V2ZF
πRepo https://lnkd.in/dJB9Ehtb
π€Demo https://lnkd.in/dXDb2MWU
π₯10π5β€1
This media is not supported in your browser
VIEW IN TELEGRAM
π#Nvidia Dynamic Pose π
πNvidia unveils DynPose-100K, the largest dataset of dynamic Internet videos annotated with camera poses. Dataset released under Nvidia licenseπ
πReview https://t.ly/wrcb0
πPaper https://lnkd.in/dycGjAyy
πProject https://lnkd.in/dDZ2Ej_Q
π€Data https://lnkd.in/d8yUSB7m
πNvidia unveils DynPose-100K, the largest dataset of dynamic Internet videos annotated with camera poses. Dataset released under Nvidia licenseπ
πReview https://t.ly/wrcb0
πPaper https://lnkd.in/dycGjAyy
πProject https://lnkd.in/dDZ2Ej_Q
π€Data https://lnkd.in/d8yUSB7m
π₯4π2β€1π€―1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π§ββοΈGENMO: Generalist Human Motion π§ββοΈ
π#Nvidia presents GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Conditioning on videos, 2D keypoints, text, music, and 3D keyframes. No code at the momentπ₯²
πReview https://t.ly/Q5T_Y
πPaper https://lnkd.in/ds36BY49
πProject https://lnkd.in/dAYHhuFU
π#Nvidia presents GENMO, a unified Generalist Model for Human Motion that bridges motion estimation and generation in a single framework. Conditioning on videos, 2D keypoints, text, music, and 3D keyframes. No code at the momentπ₯²
πReview https://t.ly/Q5T_Y
πPaper https://lnkd.in/ds36BY49
πProject https://lnkd.in/dAYHhuFU
π₯13β€3π2π’1π1
This media is not supported in your browser
VIEW IN TELEGRAM
π§€Diffusive Hand from Signsπ§€
πLIGM + #NVIDIA unveil a novel generative model of 3D hand motions from Sign Language Data. Motion characteristics such as handshapes, locations, finger, hand & arm movements. Code, Models & Data to be released π
πReview https://t.ly/HonX_
πPaper https://arxiv.org/pdf/2508.15902
πProject https://imagine.enpc.fr/~leore.bensabath/HandMDM/
πData drive.google.com/drive/u/1/folders/1BLsu2hAqhAJ_gnGb9TNXW7MLiSuSEzEj
πRepo TBA
πLIGM + #NVIDIA unveil a novel generative model of 3D hand motions from Sign Language Data. Motion characteristics such as handshapes, locations, finger, hand & arm movements. Code, Models & Data to be released π
πReview https://t.ly/HonX_
πPaper https://arxiv.org/pdf/2508.15902
πProject https://imagine.enpc.fr/~leore.bensabath/HandMDM/
πData drive.google.com/drive/u/1/folders/1BLsu2hAqhAJ_gnGb9TNXW7MLiSuSEzEj
πRepo TBA
β€4π₯3π2π€―1
This media is not supported in your browser
VIEW IN TELEGRAM
π‘οΈ3D Prompted Vision-LLMπ‘οΈ
π#Nvidia unveils SR-3D, a novel aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. Flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. Code & Dataset announcedπ
πReview https://t.ly/5Y2c5
πPaper https://arxiv.org/pdf/2509.13317
πProject https://www.anjiecheng.me/sr3d
πRepo TBA
π#Nvidia unveils SR-3D, a novel aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. Flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. Code & Dataset announcedπ
πReview https://t.ly/5Y2c5
πPaper https://arxiv.org/pdf/2509.13317
πProject https://www.anjiecheng.me/sr3d
πRepo TBA
β€6π₯5π1π1
A few βleaksβ for you from the #Nvidia presentation Iβm right now in Milan. Impressive stuff.
Ps: sorry for the shitty quality of the pics β₯οΈ
Ps: sorry for the shitty quality of the pics β₯οΈ
β€19π₯4π2π1π€―1π€©1