Gradient Dude
2.55K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
โ€‹โ€‹Transferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova (Facebook AI Research)
In CVPR 2020.

๐ŸŒhttps://asanakoy.github.io/densepose-evolution/
โ–ถ๏ธyoutu.be/OU3Ayg_l4QM
๐Ÿ“https://arxiv.org/pdf/2003.00080.pdf


โ“ What?
DensePose approach predicts the pose of humans densely and accurately given a large dataset of poses annotated in detail.
We want to extend the same approach to animals but without annotations. Because it's super expensive to collect DensePose annotations for all different classes of animals. So we show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in DensePose for humans. We propose to utilize the existing annotations of humans and do self-training on unlabeled images of animals.

In a nutshell, we first pretrain the DensePose on the existing human annotations. Then we predict DensePose on unlabeled images, select the most confident predictions and throw them in the augmented training set for retraining the model. To be able to select point-wise the most confident DensePose predictions we introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel.
We tested several techniques for sampling pseudo-labels and concluded that sampling based on confidence estimates from fine-grained tasks (24-Body-part estimation and DensePose UV-maps) results in the best performance.
We introduced a novel DensePose-Chimps dataset with Dense Pose ground truth annotations for chimps and tested our models on it, obtaining significant performance improvement over the baseline.
In this paper, we conducted thorough experiments only for chimps, but the method can be extended to other animals like cats and dogs as well.

โœ๏ธ More details:
1. To transfer DensePose from humans to animals we need a reference 3D model of an animal. Let's suppose we got an artist-created 3D model of the desired animal. The next step is to establish a dense mapping between the 3D model of animal and 3D model of a human. This is necessary to unify the evaluation protocols between humans and animals and allows to transfer of knowledge and annotations between different species. The matching between 3D models is done by matching semantic descriptors of the vertices on the meshes.

2. Our goal is to develop a DensePose predictor for a new class. Such a predictor must detect the object via a bounding box, segment it from the background, and obtain the Dense-Pose chart and UV-map coordinates for each foreground pixel. To do this we introduce a multi-head R-CNN architecture that combines multiple recognition tasks within a single model.
The first head refines the coordinates of the bounding box. The second head computes a foreground-background segmentation mask in the same way as MaskR-CNN. The third and the final head computes a part segmentation mask I, assigning each pixel to one of the 24-body parts charts, and the UV-map values for each foreground pixel.

3. We have a few existing instance-segmentation and detection annotations for some animals in the COCO dataset. Let's use them! Given a target animal class, let's say chimps. We want to find an optimal support domain: We find such classes from the COCO dataset pretraining on which gives the best detection (or segmentation) performance on the holdout set of chimps.

4. We jointly train DensePose prediction for people and detection, segmentation for other classes in the support domain. The goal is always to only build a model for the final target class โ€” we found that merging classes is an effective way of integrating information. So all support domain categories are merged in one and the training is done in a class-agnostic manner.

5. Now we have our baseline network which knows a lot about humans and a bit about the detection and segmentation of animals. We run this model over ~5Tb of videos from camera traps in the wild and select around 100k video frames with good detections. N
โ€‹โ€‹ow we aim to utilize DensePose pseudo-labels obtained on the unlabeled frames for retraining the network.
6. To be able to select good point-wise predictions our model has to know how to estimate it's uncertainty for every pixel and for every task which we are solving. We introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel and every task. We propose to model: (a) Classification uncertainty (for object classification and segmentation) using temperature scaling in the softmax layer and; (b) regression uncertainty (for bounding box proposals and DensePose UV-maps) by prediction of a Gaussian distribution instead of a single target value. The higher the predicted variance the higher the uncertainty.
7. Now given pixel-wise uncertainties we can sample for the second round of training only those foreground points from the selected 100k points which have the highest confidence. We have experimented with different sampling strategies and show that sampling based on the confidences from fine-grained tasks (24-way body part segmentation, UV-maps) results in the bests performance.
8. The network retrained on the augmented data (existing human annotations + pseudo-labeled animals) show a significant performance boost on the hold out manually annotated DensePose-Chimps dataset.
The video demonstration of the self-trained model: youtu.be/OU3Ayg_l4QM
9. We also show that the proposed Auto-Calibrated RCNN improves over the baseline even without self-training on standard DensePose-Human, detection and segmentation tasks. This is due to the higher robustness of the proposed model to unseen data distributions at test time.

โœ”๏ธConclusion:
- Studied the problem of extending dense body pose recognition to animal species and suggested that doing this at scale requires learning from unlabelled data;
- demonstrated that existing detection, segmentation, and dense pose labeling models can transfer very well to a proximal animal class such as chimpanzee despite significant inter-class differences;
- introduced Auto-Calibrated DensePose-RCNN which can estimate the uncertainty of its predictions;
- introduced novel DensePose-Chips dataset for benchmarking dense pose prediction for Chimpanzees;
- showed that substantial improvements can be obtained by carefully selecting which categories to use to pre-train the model, by using a class-agnostic architecture to integrate different sources of information;
- and by modeling labeling uncertainty to grade pseudo-labels for self-training;
- achieved excellent performance without using a single labeled image of the target class for training.
Channel name was changed to ยซGradient Dudeยป
Channel photo updated
Guys, I did some rebranding and also opened a channel in Instagram where I will post more high level information about new papers and research.
https://www.instagram.com/gradientdude/ subscribe!
Jukebox: A Generative Model for Music ๐ŸŽถ
๐ŸŒ https://openai.com/blog/jukebox
๐Ÿ’ป Google Colab: https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb

OpenAI created a neural network that can generate music. Amazing breakthrough!
It models music directly as raw audio and produces voice as well.

โ“ Challenges
- Existing symbolic generators have limitationsโ€”they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music.
- Sequences are very long. We have to deal wth extremely long-range dependencies
- A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game
- Previous work on MuseNet synthesized music based on large amounts of MIDI data.

โœ๏ธ Method

- Based on Vector Quantised-Variational AutoEncoders [VQ-VAE] (NeurIPS 2017) and VQ-VAE-2 (NeurIPS 2019)
https://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf
https://arxiv.org/pdf/1906.00446.pdf
- Hierarchical VQ-VAEs (NIPS 2018)
https://arxiv.org/abs/1806.10474

- three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level.
- Generating codes using transformers. Sparse Transformers as the learned priors for VQ-VAEs.
- 3 levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.

Learning Music Priors and Upsamplers:
- [Sparse Transformers](https://openai.com/blog/sparse-transformer/) as the learned priors for VQ-VAEs.

Conditional music generation:
- The top-level transformer is trained on the task of predicting compressed audio tokens conditioned on artist and genre.
- Lyrics conditioning using an extra encoder to produce a representation for the lyrics and attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder.

๐Ÿ—ƒ๏ธ Dataset
- Colected a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki.


โœ”๏ธ Results and Limitations

- There is still a significant gap between these generations and human-created music.
- Local musical coherence, traditional chord patterns, impressive solos.
- No choruses that repeat.
- Downsampling and upsampling process introduces discernable noise
- Very slow to sample (because of the autoregressive structure). ~9 hours to render 1 min audio.
- currently trained on English lyrics and mostly Western music.
- A set of 10 musicians from various genres were given an early access to JukeBox Tool to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations.

๐Ÿ”ฎ Future work
- Speed improvement (e.g., via model distillation)
- Reduce noise, improve quality.
- Conditioning on MIDI files.
3D Menagerie: Modeling the 3D Shape and Pose of Animal, Zuffi et al, CVPR 2017.
https://arxiv.org/abs/1611.07700

โ—The authors describe a method to create a realistic 3D model of animals and to fit this model to 2D images.

โœ๏ธ Main contribution:
- Global/Local Stitched Shape model (GLoSS) which aligns a template mesh to different shapes, providing a coarse registration between very different animals.
- Multi-Animal Linear model (SMAL) which provides a shape space of animals trained from 41 scans
- the model generalizes to new animals not seen in training
- one can fit SMAL to 2D data using detected keypoints and binary segmentations
- SMAL can generate realistic animal shapes in a variety of poses.

Authors showed that starting with toys' 3D scans, we can learn a model that generalizes to images of real animals as well as to types of animals not seen during training.
The proposed parametric SMAL model is differentiable and can be fit to the data using gradient-based algorithms.

๐ŸŒ My blog post describing the method in more details https://gdude.de/blog/2020-08-01/SMAL-CVPR2017
This media is not supported in your browser
VIEW IN TELEGRAM
I'm really into neural art. Oh boy, how amazing van Gogh is in VR! You can get an immersive experience if you follow the link https://static.kuula.io/share/79QMS