Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
This media is not supported in your browser
VIEW IN TELEGRAM
It is Sunday, pancake time 👌🏻. So I could not resist sharing this spectacular Deep Fake with you.
Neural Funk: AI generates endless breakbeats

Enthusiasts from Skoltech have trained a WaveGAN on 7500 vintage drum loops, then used the resulting model to generate thousands of new drum loops.
I have attached my favorite 6-minute sample (147 bpm). Love it!

The result was obtained by moving a point slowly through a random trajectory in the model’s latent space. Each point in the latent space corresponds to either an existing or non-existing break. Linear movement between two points results in a smooth transition between two corresponding breaks.

The pace of progress in synthetic audio and image generation is mind-blowing. Will we be able to generate infinite-movies? Imagine an infinite Harry Potter story or an endless New Year's speech of Putin 😅


▶️ A 6-hour Neural Funk on YouTube
🎧 A 6-hour sequence in wav format

📓Colab notebook with pretrained models
​​Interview with Natalia Neverova - Research Lead at Facebook AI Research

Natalia Neverova was one of my research advisors during my internship at Facebook AI Research. In this interview, she talks about the research at FAIR, which students do they prefer to hire, 3D reconstruction of people and animals (3D animals 🐒 was exactly my research project at FAIR).

🌐 Link to the interview (unfortunately, only in Russian)
China trains a 10billion parameter multimodal network… using NVIDIA’s code:

A hybrid team of researchers from Alibaba and Tsinghua University have built M6, a “Multi-Modality to Multi-Modality Multitask Mega-transformer”. M6 is a multi-modal model trained on a huge corpus of text and image data, including image-text pairs (similar to recent systems like OpenAI’s CLIP). M6 has a broad capability surface and because of how it was trained, you can use M6 to search for an image or vice versa, generate media in different modalities, match images together, write poems, answer questions, and so on.

📦 Data: ~60 million images (with accompanying text pairs) totalling 1.9TB (almost twice the raw size of ImageNet), plus 292GB of text.
📌 Facts and figures: Though the authors say they’ve trained a 10billion and 100billion parameter model, they mostly report performance statistics for the 10billion. The 100b is a mixture-of-experts model, while the 10b is based on NVIDIA’s Megatron training code. The model’s size and sophistication is notable – this feels like a symptom of the maturing capabilities of various Chinese AI organization. I wonder when we’ll get an M6-scale system from people affiliated with India, or regions like Europe or Africa.

🤷🏼‍♂️ Why this matters: M6 is notable for being a non-English model at equivalent scale to some of the largest primarily-English ones. We’re entering an era where there will be multiple, gigantic AI models, with variations stemming from the organizations that trained them. It’s also interesting to consider how these models proliferate, and who will get access to them. Will students and researchers at Tsinghua get access to M6, or just Alibaba’s researchers, or both? And how might access schemes develop in other countries, as well?

🌀 A word about bias: There’s no discussion of bias in the paper (or ethics), which isn’t typical for papers of this type but is typical of papers that come out of Chinese research organizations 😉

📝 ArXiv Paper link


Source: https://jack-clark.net/
The results, honestly, are quite good. Especially enjoyed the humble opinion about "The Great Wall" 😄
We are on the eve of the Matrix. Constantly increased dopamine level in the VR world or poverty and fighting with robots in reality.

Scientists from the universities of Helsinki used GANs to create personalized attractive faces. To gradually increase the face attractiveness they recorded the electrical activity of the brain of the tested person while changing the synthetic faces by random walking in the GAN latent space. This way, we get a GAN, in which a living person acts as a discriminator, and therefore the generated faces were more likable for that person.

I have thought about a similar idea a couple of years ago. We can analyze users' preferences in male/female appearance by their likes in social media and then use it to generate personalized ads with the faces of the most attractive people. This seems like a more feasible scenario than using brain encephalograms 🧠.
​​Imagine now, that with the help of such techniques, one can create an ideal virtual partner. To go even further, think about how personalized porn can be created with the face/appearance of the most-attractive person (maybe not even existing).

The terrible new world is almost ready 😅.

📝 Paper
🌐 Blogpost
This media is not supported in your browser
VIEW IN TELEGRAM
Learning High Fidelity Depths of Dressed Humans by Watching TikTok Dance Videos

The single-frame depth is refined by self-supervised leveraging local transformations of body parts to enforce geometric consistency across different poses.
First, depth and normal estimation network is pretrained using Synthetic 3D data (RenderPeople). Then this network is refined by using geometric consistency between pairs of different frames. Each body part transformation is modeled independently as a rigid transformation, then estimated 3D coordinates of the points on each body part can be warped onto a different frame and the disparity can be used as a loss function.

📝 Paper
🛠 Code (will be released soon)
This media is not supported in your browser
VIEW IN TELEGRAM
NeX: Real-time View Synthesis with Neural Basis Expansion

An amazing new approach to novel view synthesis a combination of multiplane image (MPI) and neural basis expansion (NeRF-like networks). It can reproduce spectacular complex view-dependent effects (see video).

Unlike traditional MPI that uses a set of simple RGBαplanes, this technique models view-dependent effects by instead parameterizing each pixel as a linear combination of basis functions learned by a neural network.

It is stunningly fast to render! The first real-time neural rendering. 60FPS! 1000x faster than NeRF.
However, training NeX still takes a long time and may require a higher number of input views to replicate view-dependent effects.


By the way it is the first paper that I see from Thailand!

📝 Paper
▶️ Video from authors
🌐 Project page
🛠 Code will come soon
Unsupervised Semantic Segmentation by Contrasting Object Mask Proposal
ETH, Luc Van Gool

TL;DR is below ⬇️

📝 Arxiv
🛠 Code
Forwarded from Self Supervised Boy
Yet again simple approach leading to unsupervised segmentation. Mostly useful as pre-training though.

Proposed pipeline first mines saliency object areas (with any available framework, possibly supervised) and then makes contrast learning for pixel embeddings inside those regions. During second step individual pixel embedding is attracted to the mean embedding of its object and pushed away from mean embeddings of other objects. This additional detail differs it from some previously proposed pipelines and allows wider training, because of slower growing rate of the loss pairs.

Less briefly and with some external links here.
Source here.