Guys, I did some rebranding and also opened a channel in Instagram where I will post more high level information about new papers and research.
https://www.instagram.com/gradientdude/ subscribe!
https://www.instagram.com/gradientdude/ subscribe!
Jukebox: A Generative Model for Music 🎶
🌐 https://openai.com/blog/jukebox
💻 Google Colab: https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb
OpenAI created a neural network that can generate music. Amazing breakthrough!
It models music directly as raw audio and produces voice as well.
❓ Challenges
- Existing symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music.
- Sequences are very long. We have to deal wth extremely long-range dependencies
- A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game
- Previous work on MuseNet synthesized music based on large amounts of MIDI data.
✏️ Method
- Based on Vector Quantised-Variational AutoEncoders [VQ-VAE] (NeurIPS 2017) and VQ-VAE-2 (NeurIPS 2019)
https://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf
https://arxiv.org/pdf/1906.00446.pdf
- Hierarchical VQ-VAEs (NIPS 2018)
https://arxiv.org/abs/1806.10474
- three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level.
- Generating codes using transformers. Sparse Transformers as the learned priors for VQ-VAEs.
- 3 levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
Learning Music Priors and Upsamplers:
- [Sparse Transformers](https://openai.com/blog/sparse-transformer/) as the learned priors for VQ-VAEs.
Conditional music generation:
- The top-level transformer is trained on the task of predicting compressed audio tokens conditioned on artist and genre.
- Lyrics conditioning using an extra encoder to produce a representation for the lyrics and attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder.
🗃️ Dataset
- Colected a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki.
✔️ Results and Limitations
- There is still a significant gap between these generations and human-created music.
- Local musical coherence, traditional chord patterns, impressive solos.
- No choruses that repeat.
- Downsampling and upsampling process introduces discernable noise
- Very slow to sample (because of the autoregressive structure). ~9 hours to render 1 min audio.
- currently trained on English lyrics and mostly Western music.
- A set of 10 musicians from various genres were given an early access to JukeBox Tool to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations.
🔮 Future work
- Speed improvement (e.g., via model distillation)
- Reduce noise, improve quality.
- Conditioning on MIDI files.
🌐 https://openai.com/blog/jukebox
💻 Google Colab: https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb
OpenAI created a neural network that can generate music. Amazing breakthrough!
It models music directly as raw audio and produces voice as well.
❓ Challenges
- Existing symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music.
- Sequences are very long. We have to deal wth extremely long-range dependencies
- A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game
- Previous work on MuseNet synthesized music based on large amounts of MIDI data.
✏️ Method
- Based on Vector Quantised-Variational AutoEncoders [VQ-VAE] (NeurIPS 2017) and VQ-VAE-2 (NeurIPS 2019)
https://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf
https://arxiv.org/pdf/1906.00446.pdf
- Hierarchical VQ-VAEs (NIPS 2018)
https://arxiv.org/abs/1806.10474
- three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level.
- Generating codes using transformers. Sparse Transformers as the learned priors for VQ-VAEs.
- 3 levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
Learning Music Priors and Upsamplers:
- [Sparse Transformers](https://openai.com/blog/sparse-transformer/) as the learned priors for VQ-VAEs.
Conditional music generation:
- The top-level transformer is trained on the task of predicting compressed audio tokens conditioned on artist and genre.
- Lyrics conditioning using an extra encoder to produce a representation for the lyrics and attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder.
🗃️ Dataset
- Colected a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki.
✔️ Results and Limitations
- There is still a significant gap between these generations and human-created music.
- Local musical coherence, traditional chord patterns, impressive solos.
- No choruses that repeat.
- Downsampling and upsampling process introduces discernable noise
- Very slow to sample (because of the autoregressive structure). ~9 hours to render 1 min audio.
- currently trained on English lyrics and mostly Western music.
- A set of 10 musicians from various genres were given an early access to JukeBox Tool to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations.
🔮 Future work
- Speed improvement (e.g., via model distillation)
- Reduce noise, improve quality.
- Conditioning on MIDI files.
Openai
Jukebox
We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.
3D Menagerie: Modeling the 3D Shape and Pose of Animal, Zuffi et al, CVPR 2017.
https://arxiv.org/abs/1611.07700
❗The authors describe a method to create a realistic 3D model of animals and to fit this model to 2D images.
✏️ Main contribution:
- Global/Local Stitched Shape model (GLoSS) which aligns a template mesh to different shapes, providing a coarse registration between very different animals.
- Multi-Animal Linear model (SMAL) which provides a shape space of animals trained from 41 scans
- the model generalizes to new animals not seen in training
- one can fit SMAL to 2D data using detected keypoints and binary segmentations
- SMAL can generate realistic animal shapes in a variety of poses.
Authors showed that starting with toys' 3D scans, we can learn a model that generalizes to images of real animals as well as to types of animals not seen during training.
The proposed parametric SMAL model is differentiable and can be fit to the data using gradient-based algorithms.
🌐 My blog post describing the method in more details https://gdude.de/blog/2020-08-01/SMAL-CVPR2017
https://arxiv.org/abs/1611.07700
❗The authors describe a method to create a realistic 3D model of animals and to fit this model to 2D images.
✏️ Main contribution:
- Global/Local Stitched Shape model (GLoSS) which aligns a template mesh to different shapes, providing a coarse registration between very different animals.
- Multi-Animal Linear model (SMAL) which provides a shape space of animals trained from 41 scans
- the model generalizes to new animals not seen in training
- one can fit SMAL to 2D data using detected keypoints and binary segmentations
- SMAL can generate realistic animal shapes in a variety of poses.
Authors showed that starting with toys' 3D scans, we can learn a model that generalizes to images of real animals as well as to types of animals not seen during training.
The proposed parametric SMAL model is differentiable and can be fit to the data using gradient-based algorithms.
🌐 My blog post describing the method in more details https://gdude.de/blog/2020-08-01/SMAL-CVPR2017
Gradient Dude
Multi-Animal Linear model (SMAL): Modeling the 3D Shape and Pose of Animals
“3D Menagerie: Modeling the 3D Shape and Pose of Animal”, Zuffi et al, CVPR 2017.
This media is not supported in your browser
VIEW IN TELEGRAM
I'm really into neural art. Oh boy, how amazing van Gogh is in VR! You can get an immersive experience if you follow the link https://static.kuula.io/share/79QMS