Gradient Dude
2.35K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
This media is not supported in your browser
VIEW IN TELEGRAM
🦿Avatars Grow Legs

I'm thrilled to share with you my latest research paper (CVPR 2023)! This was a joint effort with my intern at Meta Reality Labs before our team transitioned to GenAI.

Our innovative method, dubbed Avatars Grow Legs (AGRoL), aims to control a 3D avatar's entire body in VR without the need for extra sensors. Typically in VR, your interaction is limited to a headset and two handheld joysticks, leaving out any direct input from your legs. This limitation persists despite the Quest's downward-facing cameras, as they rarely capture the unocludded view of the legs.

To tackle this, we've introduced a novel solution based on a diffusion model. Our model synthesizes the 3D movement of the whole body conditioned solely on the tracking data from the hands and the head, circumventing the need for direct observation of the legs.

Moreover, we've designed AGRoL with an efficient architecture
enabling 30 FPS synthsis on a V100 GPU.

❱❱ Code and weights
❱❱ Paper
❱❱ Project Page

@gradientdude
Media is too big
VIEW IN TELEGRAM
Demo of our Avatars Grow Legs model that synthesizes the full 3D body motion based on sparse tracking inputs from the head and the wrists.

More details are in the paper.

@gradientdude
This media is not supported in your browser
VIEW IN TELEGRAM
Today I will be presenting our CVPR2023 poster "Avatars Grow Legs: Generating Smooth Human Motion from Sparse Tracking Inputs with Diffusion Model".

Learn how to synthesize full body motion based on 3 known points only (head and wrists)!

❱❱ Detailed post about the paper.

Come to chat with me today at 10:30-12:30 PDT, poster #46 if you are at CVPR.

@gradientdude
Staff Research Scientist: Personal Update

I have some exciting news that I'd like to share with you! On Monday, I was promoted to E6, which means I am now a Staff Research Scientist at Meta GenAI.

This was made possible thanks to the significant impact and scope of a Generative AI project that I proposed, led, and completed last year. The project is not yet public, so I can't share details about it right now.

Before this, I was at the terminal level - Senior Research Scientist, a position many get stuck in forever. It takes extra effort and personal qualities to break out of this limbo and become a Staff member. But now, I've unlocked a new ladder, E6+, where leveling up is significantly more challenging than between Junior (E3) and Senior (E5) levels. However, this also presents a challenge and an opportunity for further development!

Exciting stuff!

@gradientdude
I'm getting back to juicy posts in english!

@gradientdude
⚡️SD3-Turbo: Fast High-Resolution Image Synthesis with Latent Adversarial Diffusion Distillation

Following Stable Diffusion 3, my ex-colleagues have published a preprint on SD3 distillation using 4-step, while maintaining quality.

The new method Latent Adversarial Diffusion Distillation (LADD), which is similar to ADD (see post about it in @ai_newz), but with a number of differences:

↪️ Both teacher and student are on a Transformer-based SD3 architecture here.
The biggest and best model has 8B parameters.

↪️Instead of DINOv2 discriminator working on RGB pixels, this article suggests going back to latent space discriminator in order to work faster and burn less memory.

↪️A copy of the teacher is taken as a discriminator (i.e. the discriminator is trained generatively instead of discriminatively, as in the case of DINO). After each attention block, a discriminator head with 2D conv layers that classifies real/fake is added. This way the discriminator looks not only at the final result but at all in-between features, which strengthens the training signal.

↪️Trained on pictures with different aspect ratios, rather than just 1:1 squares.

↪️They removed L2 reconstruction loss between Teacher's and Student's outputs. It's said that a blunt discriminator is enough if you choose the sampling distribution of steps t wisely.

↪️During training, they more frequently sample t with more noise so that the student learns to generate the global structure of objects better.

↪️Distillation is performed on synthetic data which was generated by the teacher, rather than on a photo from a dataset, as was the case in ADD.

It's also been shown that the DPO-LoRA tuning is a pretty nice way to add to the quality of the student's generations.

So, we get SD3-Turbo model producing nice pics in 4 steps. According to a small Human Eval (conducted only on 128 prompts), the student is comparable to the teacher in terms of image quality. But the student's prompt alignment is inferior, which is expected.

📖 Paper

@gradientdude
Please open Telegram to view this post
VIEW IN TELEGRAM
Please open Telegram to view this post
VIEW IN TELEGRAM
The paper also shows that SD3-Turbo is better than Midjourney 6 in both image quality and prompt alignment, which is surprising 🫥. Looking forward to the weights release to do a reality check!

@gradientdude
Please open Telegram to view this post
VIEW IN TELEGRAM
Suno v3 - The best txt2music model
Recently released Suno v3 is the absolute best txt2music and txt2audio model ever.

Suno v3 is capable of generating actually interesting 2-minute songs in one go (or even potentially indefinitely long ones with the continue function). And yes, precisely songs! Because it also generates vocals, which have been greatly upgraded in the last version. So to put it in perspective, Suno v3 is now on the level of Midjorney v3. Beautiful, but with some quirks.

The release of Suno v3 is like the rise of the first txt2img models (e.g LDM). At first, everyone was typing random ideas into the prompt and was amazed at how beautiful it turned out to be. Then we wanted to understand how to make the result not just beautiful, but to control it the way we want. All kinds of PDFs, and GitHub with prompting guides appeared. It's all the same with Suno - one need to know how to pompt it.

@gradientdude
🎸 Suno v3 prompt engineering guide

Go to the app homepage, create tab. There's the simple mode (which will generate a song and lyrics, but without the tricks below), and the custom mode wiht more contorl. We tap the second, of course. Now we see a prompt and lyrics window.

1. Workflow.
First generation is a max of 2 minutes. Usually, it can include an intro, verse, and chorus (maybe more if you have a high tempo). Then, click "continue"; that's about one more minute - another verse and/or chorus.

You can do this in a number of ways. Here is my favourite way:
1. Insert the prompt and all of the lyrics.
2. Hit continue from this track. Cut out all the lyrics that have already been sung and generate again. Optionally, you can move the splice with "continue from" to the end of the previous verse/chorus and/or change the prompt for the new part.
3. Repeat step 2 until you run out of words.
4. Get Whole Song — *click*
5. Sign up for onerpm, generate cover art, insert lyrics and in two weeks your track is on all streaming platforms 🤭

2. The Prompt template.

A combo that works best is:

(Genres), (description of mood/tempo/idea), (some specific instruments, details).

3. Metatags are our bread and butter!
Metatags are the instructions inside [ ] in the lyrics window. They tell suno what to do. Metatags are a field for experimentation, they may or may not work. You can write there anything you can think of.

Here's a couple of ideas.

The standard structure of a pop track looks like this:

You can get by without it, but that way a chunk of the verse won't slide into the chorus.

[Intro]
[Verse 1]
[Pre-chorus]
[Chorus]
[Bridge] - can be inserted anywhere, there are also [guitar solo] or [percussion break] options.
[Verse 2]
[Pre-chorus]
[Chorus]
[Outro]
[End] - without it, the track may not even finish.

- singing style
[Soft female singing]
[Hyper-aggressive lead guitar solo] - yeah, yeah, instruments too.
[Epic chorus]
[Rap]

- [instrumental] so that suno doesn't hallucinate the lyrics itself.

- You can try to spell out the part of some instrument, lol
[Percussion Break]
. . ! . . ! . . ! - did you recognize it?

[sad trombone]
waah-Waah-WaAaH.

4. (text)
brackets for backs, choirs and other stuff.

5. Solo Vocals, [Lead Vocalist], etc.
Suno likes dubs and choirs, but the quality and intelligibility of the words will suffer. Highly recommended for use.

6. Emphasis.
Time to remember second grade 😄 All for the sake of controlling pronunciation, intonation, and rhythmic accents.

Here are the symbols to copy and paste:
Á É Í Ó Ó Ú Ú Ý
á é í ó ú ú ý

You might not need them but in certain situations they will help.

7. Getting Inspired.
If you like some song from the top chart, you can continue from any point of it and add your own lyrics.

8. Suno v3 is smarter than you think.
Sometimes it's better to give it more freedom. And sometimes (often) it will straight up ignore your stupid not-so-successful creative ideas.

There you go. Remember, the trial-and-error method led humans to dominance. The same idea can be applied to working with neural networks. You can also learn how to generate nice songs!

Suno's app: https://app.suno.ai/
Here's also a link to a playlist with cherry picks.

#tutorial
@gradientdude
Please open Telegram to view this post
VIEW IN TELEGRAM