Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
This media is not supported in your browser
VIEW IN TELEGRAM
Researchers from Berkeley rolled out VideoGPT - a transformer that generates videos.

The results are not super "WOW", but the architecture is quite simple and now it can be a starting point for all future work in this direction. As you know, GPT-3 for text generation was also not built right away. So let's will wait for method acceleration and quality improvement.

πŸ“Paper
βš™οΈCode
🌐Project page
πŸƒDemo
Infinite image generation and resampling πŸ”₯

This method can generate infinite images of diverse and complex scenes that transition naturally from one into another. It does so without any conditioning and trains without any supervision from a dataset of unrelated square images.

You can check an interactive demo on the project website.

πŸ“Paper
This media is not supported in your browser
VIEW IN TELEGRAM
Snap has released a new model for animating the entire human body (not just the face). Looks pretty good.

The principle is similar to their previous method - First order motion model for animation of heads. The difference is that (a) the background motion is explicitly modeled here; and (b) instead of regressing local affine transformations for a set of keypoints, this method learns to find heatmaps of different body parts in unsupervised way and
the transformation matrix of each body part is computed by applying principal component analysis (PCA) to the predicted heatmaps.

More details on the project website. Most importantly, there is code and pretrained weights. So go ahead and animate!

P.S. 2 years ago another method for animating the whole body "Everybody Dance Now" was released, but there you had to retrain the network for each new person.
Moore's law is still working. Yesterday IBM has announced that they created the first 2nm chip!

They claim that their 2nm development will improve performance by 45% at the same power, or 75% energy at the same performance, compared to modern 7nm processors (e.g., Intel's).

IBM is one of the world’s leading research centers on future semiconductor technology, but they have sold its manufacturing to GlobalFoundries in 2014 so currently, IBM only develops IP in collaboration with others (Samsung and recently announced Intel) for their manufacturing facilities.

The latest NVIDIA GPUs based on Ampere microarchitecture (2020) use TSMC 7 nm fabrication process. TSMC's 3nm is already entering into production in 2022. But when is IBM/Intel's 2nm even coming? I'm also curious if Intel can even manage their 5nm chips by 2024/25.

Source article.
Another cool work from OpenAI: Diffusion Models Beat GANs on Image Synthesis.
New SOTA for image generation on ImageNet

A new type of generative models is proposed - Diffusion Probabilistic Model. The diffusion model is a parameterized Markov chain trained using variational inference to generate samples matching data after finite time. The diffusion process here is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. So here we are learning reverse transitions in this chain, which reverse the diffusion process. And of course, we parameterize everything with neural networks.

It produces very high-quality generations, even better than with GANs (it is especially clearly seen on the man with a fish, who is not that spectacular in the BigGAN model). The current disadvantage of diffusion models is slow training and inference.

πŸ“ Paper
βš™οΈ Code
Scheme of the Denoising Diffusion Probablistic Model.

Sampling process goes from left to right, while Diffusion goes from right to left by gradually adding noise to the input.
Chinese researchers are very fond of doing extensive surveys of a particular sub-field of machine learning, listing the main works and the major breakthrough ideas. There are so many articles published every day, and it is impossible to read everything. Therefore, such reviews are valuable (if they are well written, of course, which is quite rare).

Recently there was a very good paper reviewing various variants of Transformers with a focus on language modeling (NLP). This is a must-read for anyone getting into the world of NLP and interested in Transformers. The paper discusses the basic principles of self-attention and such details of modern variants of Transformers as architecture modifications, pre-training, and various applications.

πŸ“Paper: A Survey of Transformers.
​​Facebook AI has built a system called TextStyleBrush that can replace text both in scenes and handwriting β€” in one shot β€” using only a single example word.
The model was made self-supervised because it is utterly hard to collect labeled pairs of text in different conditions, and to annotate the segmentation masks for text (although I think it can be done using synthetic generation).

The model is trained to understand unlimited text styles for not just different typography and calligraphy, but also for different transformations, like rotations, curved text, and deformations that happen between paper and pen when handwriting; background clutter; and image noise. The main idea is to disentangle the content of a text image from all aspects of the appearance of the entire word box. The representation of the overall appearance can then be applied as a one-shot-transfer without retraining on the novel source style samples.

The model consists of a style encoder, content encoder, and stylized text generator (plus a bunch of losses).
The generator architecture is based on the StyleGAN2 model. However, the design of StyleGAN2 has an important limitation: StyleGAN2 is an unconditional model, meaning it generates images by sampling a random latent vector. For generating photo-realistic text images, however, one needs to control the output based on two separate sources: the desired text content and style. This is solved by extracting layer-specific style information and injecting it at each layer of the generator (it is some sort of conditional instance normalization).

The losses are the following: 1) reconstruction and cycle loss; 2) Discriminator real/fake; 3) Recognizer - the network that recognizes text on the stylized image and makes sure that no content is lost; 4) Typeface classifier - a pretrained network that measures how well the generator captures the style of input.

Results are quite striking!
Now imagine how you drive through the busy streets of Hong Kong and see street signs projected on the windshield of your car and translated online. Or one day used we will send personalized messages by generating some creative images with the text embedded in them (instead of stickers).

πŸŒ€ Blogpost
πŸ“ Paper
This is the architecture. Content encoder encodes text, Style encoder extracts style and Generator generates stylized text conditioned on a style vector.
Media is too big
VIEW IN TELEGRAM
Just a small announcement πŸ”₯
Our new (with Facebook AI Research) #CVPR21 paper is out!

Discovering Relationships between Object Categories via Universal Canonical Maps

TL;DR: Densepose method for Animals on Steroids which as a byproduct can automatically discover correspondences between 3D shapes of animals using novel cycle losses.

I will present the paper Today (21.06) at 11am EDT / 5PM CET. Feel free to join live Q&A session and ask me a questionπŸ˜‰.

🌐 Project page
▢️ Video explanation
πŸ“ Paper
πŸ›  Source code
(1) High-level scheme of our method and (2) some more results.
​​I'm happy to announce that our team (me, Stepan Konev, Kirill Brodt) was awardedπŸ… 3rd place within the Waymo Motion Prediction Challenge 2021.

To plan a safe and efficient route, an autonomous vehicle should anticipate future motions of other agents around it. Motion prediction is an extremely challenging task that recently gained significant attention from the research community. We present a simple and yet very strong baseline for multimodal motion prediction based purely on Convolutional Neural Networks.

The task is the following: Given agents' tracks for the past 1 second on a corresponding map, we had to predict the positions of the agents on the road for 8 seconds into the future.

Our model takes a raster image centered around a target agent as input and directly predicts a set of possible trajectories along with their confidences. The raster image is obtained by rasterisation of a scene and the history of all the agents. While being easy-to-implement, the proposed approach achieves competitive performance compared to the state-of-the-art methods on the Waymo Open Dataset Motion Prediction Challenge (2021): Our model ranks 1st using minimum average displacement error and 3rd using mAP score.

We wrote a small paper and release our code!

πŸ“œTechnical report
βš’Code
Pipeline of our motion prediciton approach (MotionCNN) and the results.