Jukebox: A Generative Model for Music 🎶
🌐 https://openai.com/blog/jukebox
💻 Google Colab: https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb
OpenAI created a neural network that can generate music. Amazing breakthrough!
It models music directly as raw audio and produces voice as well.
❓ Challenges
- Existing symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music.
- Sequences are very long. We have to deal wth extremely long-range dependencies
- A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game
- Previous work on MuseNet synthesized music based on large amounts of MIDI data.
✏️ Method
- Based on Vector Quantised-Variational AutoEncoders [VQ-VAE] (NeurIPS 2017) and VQ-VAE-2 (NeurIPS 2019)
https://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf
https://arxiv.org/pdf/1906.00446.pdf
- Hierarchical VQ-VAEs (NIPS 2018)
https://arxiv.org/abs/1806.10474
- three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level.
- Generating codes using transformers. Sparse Transformers as the learned priors for VQ-VAEs.
- 3 levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
Learning Music Priors and Upsamplers:
- [Sparse Transformers](https://openai.com/blog/sparse-transformer/) as the learned priors for VQ-VAEs.
Conditional music generation:
- The top-level transformer is trained on the task of predicting compressed audio tokens conditioned on artist and genre.
- Lyrics conditioning using an extra encoder to produce a representation for the lyrics and attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder.
🗃️ Dataset
- Colected a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki.
✔️ Results and Limitations
- There is still a significant gap between these generations and human-created music.
- Local musical coherence, traditional chord patterns, impressive solos.
- No choruses that repeat.
- Downsampling and upsampling process introduces discernable noise
- Very slow to sample (because of the autoregressive structure). ~9 hours to render 1 min audio.
- currently trained on English lyrics and mostly Western music.
- A set of 10 musicians from various genres were given an early access to JukeBox Tool to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations.
🔮 Future work
- Speed improvement (e.g., via model distillation)
- Reduce noise, improve quality.
- Conditioning on MIDI files.
🌐 https://openai.com/blog/jukebox
💻 Google Colab: https://colab.research.google.com/github/openai/jukebox/blob/master/jukebox/Interacting_with_Jukebox.ipynb
OpenAI created a neural network that can generate music. Amazing breakthrough!
It models music directly as raw audio and produces voice as well.
❓ Challenges
- Existing symbolic generators have limitations—they cannot capture human voices or many of the more subtle timbres, dynamics, and expressivity that are essential to music.
- Sequences are very long. We have to deal wth extremely long-range dependencies
- A typical 4-minute song at CD quality (44 kHz, 16-bit) has over 10 million timesteps. For comparison, GPT-2 had 1,000 timesteps and OpenAI Five took tens of thousands of timesteps per game
- Previous work on MuseNet synthesized music based on large amounts of MIDI data.
✏️ Method
- Based on Vector Quantised-Variational AutoEncoders [VQ-VAE] (NeurIPS 2017) and VQ-VAE-2 (NeurIPS 2019)
https://papers.nips.cc/paper/7210-neural-discrete-representation-learning.pdf
https://arxiv.org/pdf/1906.00446.pdf
- Hierarchical VQ-VAEs (NIPS 2018)
https://arxiv.org/abs/1806.10474
- three levels in our VQ-VAE, shown below, which compress the 44kHz raw audio by 8x, 32x, and 128x, respectively, with a codebook size of 2048 for each level.
- Generating codes using transformers. Sparse Transformers as the learned priors for VQ-VAEs.
- 3 levels of priors: a top-level prior that generates the most compressed codes, and two upsampling priors that generate less compressed codes conditioned on above.
Learning Music Priors and Upsamplers:
- [Sparse Transformers](https://openai.com/blog/sparse-transformer/) as the learned priors for VQ-VAEs.
Conditional music generation:
- The top-level transformer is trained on the task of predicting compressed audio tokens conditioned on artist and genre.
- Lyrics conditioning using an extra encoder to produce a representation for the lyrics and attention layers that use queries from the music decoder to attend to keys and values from the lyrics encoder.
🗃️ Dataset
- Colected a new dataset of 1.2 million songs (600,000 of which are in English), paired with the corresponding lyrics and metadata from LyricWiki.
✔️ Results and Limitations
- There is still a significant gap between these generations and human-created music.
- Local musical coherence, traditional chord patterns, impressive solos.
- No choruses that repeat.
- Downsampling and upsampling process introduces discernable noise
- Very slow to sample (because of the autoregressive structure). ~9 hours to render 1 min audio.
- currently trained on English lyrics and mostly Western music.
- A set of 10 musicians from various genres were given an early access to JukeBox Tool to discuss their feedback on this work. While Jukebox is an interesting research result, these musicians did not find it immediately applicable to their creative process given some of its current limitations.
🔮 Future work
- Speed improvement (e.g., via model distillation)
- Reduce noise, improve quality.
- Conditioning on MIDI files.
Openai
Jukebox
We’re introducing Jukebox, a neural net that generates music, including rudimentary singing, as raw audio in a variety of genres and artist styles. We’re releasing the model weights and code, along with a tool to explore the generated samples.
3D Menagerie: Modeling the 3D Shape and Pose of Animal, Zuffi et al, CVPR 2017.
https://arxiv.org/abs/1611.07700
❗The authors describe a method to create a realistic 3D model of animals and to fit this model to 2D images.
✏️ Main contribution:
- Global/Local Stitched Shape model (GLoSS) which aligns a template mesh to different shapes, providing a coarse registration between very different animals.
- Multi-Animal Linear model (SMAL) which provides a shape space of animals trained from 41 scans
- the model generalizes to new animals not seen in training
- one can fit SMAL to 2D data using detected keypoints and binary segmentations
- SMAL can generate realistic animal shapes in a variety of poses.
Authors showed that starting with toys' 3D scans, we can learn a model that generalizes to images of real animals as well as to types of animals not seen during training.
The proposed parametric SMAL model is differentiable and can be fit to the data using gradient-based algorithms.
🌐 My blog post describing the method in more details https://gdude.de/blog/2020-08-01/SMAL-CVPR2017
https://arxiv.org/abs/1611.07700
❗The authors describe a method to create a realistic 3D model of animals and to fit this model to 2D images.
✏️ Main contribution:
- Global/Local Stitched Shape model (GLoSS) which aligns a template mesh to different shapes, providing a coarse registration between very different animals.
- Multi-Animal Linear model (SMAL) which provides a shape space of animals trained from 41 scans
- the model generalizes to new animals not seen in training
- one can fit SMAL to 2D data using detected keypoints and binary segmentations
- SMAL can generate realistic animal shapes in a variety of poses.
Authors showed that starting with toys' 3D scans, we can learn a model that generalizes to images of real animals as well as to types of animals not seen during training.
The proposed parametric SMAL model is differentiable and can be fit to the data using gradient-based algorithms.
🌐 My blog post describing the method in more details https://gdude.de/blog/2020-08-01/SMAL-CVPR2017
Gradient Dude
Multi-Animal Linear model (SMAL): Modeling the 3D Shape and Pose of Animals
“3D Menagerie: Modeling the 3D Shape and Pose of Animal”, Zuffi et al, CVPR 2017.
This media is not supported in your browser
VIEW IN TELEGRAM
I'm really into neural art. Oh boy, how amazing van Gogh is in VR! You can get an immersive experience if you follow the link https://static.kuula.io/share/79QMS
Watch For Undead! Undead spooky images generated by fine-tuning StyleGAN2 by twitter user @Norod78.
Image dataset and the model used for this can be found here https://mega.nz/folder/C0UBDYQY#v57wYcnhXQooj0C7acJyvA
Image dataset and the model used for this can be found here https://mega.nz/folder/C0UBDYQY#v57wYcnhXQooj0C7acJyvA
mega.nz
520.25 MB folder on MEGA
45 files and 6 subfolders
There is a bunch of Contrastive Representation Learning methods exist already, e.g. MoCo, SimCLR, BYOL, etc.
Here is another one - CLIM: Center-wise local image muxture for contrastive representation learning (ICLR 2021).
The main idea is to consider the semantic similarity between different images and incorporate it in the learning procedure, in contrast to the many contrastive learning methods which usually use augmentations of the query image as positives. The main contribution is 2-fold:
a) partition data in 10k clusters, use nearest neighbors from the same cluster which are closer to the centroid than the anchor as positive samples;
b) use more complex augmentations, i.e. CutMix and multi-resolution during training. The proposed method achieves state-of-the-art results for unsupervised learning on Imagenet and transfer learning tasks Pascal VOC, COCO, and LVIS.
Here is another one - CLIM: Center-wise local image muxture for contrastive representation learning (ICLR 2021).
The main idea is to consider the semantic similarity between different images and incorporate it in the learning procedure, in contrast to the many contrastive learning methods which usually use augmentations of the query image as positives. The main contribution is 2-fold:
a) partition data in 10k clusters, use nearest neighbors from the same cluster which are closer to the centroid than the anchor as positive samples;
b) use more complex augmentations, i.e. CutMix and multi-resolution during training. The proposed method achieves state-of-the-art results for unsupervised learning on Imagenet and transfer learning tasks Pascal VOC, COCO, and LVIS.
Let's talk a bit about object detectors.
But we want to have a single bounding box per object (not hundreds of them), right?
You probably know that in most detection pipelines there is a step called Non-Maximum suppression (NMS) which is responsible for this.
Its purpose is to get a lot of tentative detections proposed by the networks and drop all spurious and highly overlapping ones, retaining only a single the most confident bounding box per object.
The de facto standard approach for NMS is Greedy NMS. At each step, we select the most confident box and drop all others which have IoU greater than some fixed threshold (often 0.5). We repeat this process until no proposals are left. But this approach is very taxing and requires manual tweaking (which I personally hate). For example, if you set a threshold too high you may lose Recall, since some very close objects would be considered as duplicate detections and would be dropped. On the other hand, a lower threshold may leave you with too many spurious detections.
Therefore researchers from Max Planck Institute for Informatics in Saarbrücken proposed a method for end-to-end learnable NMS. Now we train a neural network to do NMS instead of a greedy algorithm. Details are in the paper: 📰 "Learning non-maximum suppression"
I briefly summarize the proposed algorithm, but for more detail refer to the paper.
If the object is already assigned to one detection, all other detections with the high overlap (neighboring detections) should be notified about it and should decrease their scores. To do this the paper proposed to compute pairwise features between overlapping proposals (these features are handcrafted and include IoU, normalized distance in X and Y directions, a difference of width and height, aspect ratio difference, detection scores, etc.). These pairwise features are concatenated with original detection features produced by the CNN backbone and are passed through a series of residual blocks with FC layers (see Figure). Next, to assign only a single detection per object, authors run a Hungarian matching algorithm between GT and detections and enforce all non-matched detections to decrease their scores. After that, the proposed network, called GNet, can produce only a few boxes with very string scores, and all other boxes are assigned very teeny ones (see examples below).
But we want to have a single bounding box per object (not hundreds of them), right?
You probably know that in most detection pipelines there is a step called Non-Maximum suppression (NMS) which is responsible for this.
Its purpose is to get a lot of tentative detections proposed by the networks and drop all spurious and highly overlapping ones, retaining only a single the most confident bounding box per object.
The de facto standard approach for NMS is Greedy NMS. At each step, we select the most confident box and drop all others which have IoU greater than some fixed threshold (often 0.5). We repeat this process until no proposals are left. But this approach is very taxing and requires manual tweaking (which I personally hate). For example, if you set a threshold too high you may lose Recall, since some very close objects would be considered as duplicate detections and would be dropped. On the other hand, a lower threshold may leave you with too many spurious detections.
Therefore researchers from Max Planck Institute for Informatics in Saarbrücken proposed a method for end-to-end learnable NMS. Now we train a neural network to do NMS instead of a greedy algorithm. Details are in the paper: 📰 "Learning non-maximum suppression"
I briefly summarize the proposed algorithm, but for more detail refer to the paper.
If the object is already assigned to one detection, all other detections with the high overlap (neighboring detections) should be notified about it and should decrease their scores. To do this the paper proposed to compute pairwise features between overlapping proposals (these features are handcrafted and include IoU, normalized distance in X and Y directions, a difference of width and height, aspect ratio difference, detection scores, etc.). These pairwise features are concatenated with original detection features produced by the CNN backbone and are passed through a series of residual blocks with FC layers (see Figure). Next, to assign only a single detection per object, authors run a Hungarian matching algorithm between GT and detections and enforce all non-matched detections to decrease their scores. After that, the proposed network, called GNet, can produce only a few boxes with very string scores, and all other boxes are assigned very teeny ones (see examples below).
This is not a paper, but it is awesome!
App Polycam uses a builtin LIDAR sensor in the latest iPad Pro to scan the surroundings and build a textured 3D mesh. The mesh is “generally accurate down to about one inch”. The process is also near-real-time: processing is done locally on the tablet, with single-room captures taking “only seconds to process”, making it possible to see the mesh building up as you walk around. Looks like it's one of the best 3D scanning app out there (for arbitrary objects, see 3D sofa example here).
However, it relies on LIDAR which we can find only in the latest iPad Pro and upcoming iPhone 12 Pro. It would be much more exciting if they used pure RGB-based techniques, e.g. SLAM which does no require a LIDAR or a depth camera. I will come back to this and will briefly discuss some techniques for building 3D shapes from images in future posts.
App Polycam uses a builtin LIDAR sensor in the latest iPad Pro to scan the surroundings and build a textured 3D mesh. The mesh is “generally accurate down to about one inch”. The process is also near-real-time: processing is done locally on the tablet, with single-room captures taking “only seconds to process”, making it possible to see the mesh building up as you walk around. Looks like it's one of the best 3D scanning app out there (for arbitrary objects, see 3D sofa example here).
However, it relies on LIDAR which we can find only in the latest iPad Pro and upcoming iPhone 12 Pro. It would be much more exciting if they used pure RGB-based techniques, e.g. SLAM which does no require a LIDAR or a depth camera. I will come back to this and will briefly discuss some techniques for building 3D shapes from images in future posts.
YouTube
Polycam preview
3D scanning with the Polycam app on iOS. Learn more at https://polycam.ai/
There is another cool app in3D created by my mates that can build your 3D avatar from a 360 video capturing you from different angles. They achieve compelling results with their avatars capturing fine shape details and automatically rigged (see avatar example). However, the app is currently available only for iPhones as well, but at least does not require a LIDAR sensor 😅.
YouTube
in3D app: 3D body scanning with an iPhone
App: https://apple.co/2FcBZ7B
Web: http://in3d.io/
Scan yourself into GTA V, Second Life or VRChat.
Contact us: hello@in3d.io
Web: http://in3d.io/
Scan yourself into GTA V, Second Life or VRChat.
Contact us: hello@in3d.io
Scientists from the University of Washington broke the longstanding record in solving the notorious NP-hard problem — Travelling Salesman Problem (TSP). This optimization problem, which seeks the shortest (or least expensive) round trip through a collection of cities, has applications ranging from DNA sequencing to ride-sharing logistics.
There were no advancements in this field since 1976 when Nicos Christofides came up with an algorithm that efficiently finds approximate solutions — round trips that are at most 50% longer than the best round trip.
Funny enough that the novel algorithm improves the previous approximate algorithm by a whopping margin of 2.0 x 10^-36 !!! (Yes, it is 0.2 billionth of a trillionth of a trillionth of a percent.) But please don't be too disappointed (although I was). This result breaks a theoretical and psychological barrier that persisted for more than forty years. And hopefully, it will spike the interest of the broader community about this problem and will lead to further advancements in the next years. Moreover, it is likely (although not proven yet) that the proposed algorithm is much more efficient than the predecessor in most of the cases and it improves at least by that tiny margin in the worst case.
As a Deep Learning evangelist, my first impression after reading the caption was that it was another victory of Neural Networks, however, I was wrong, and yet not all the cool stuff is done with the help of NNs. The method is based on the machinery called the geometry of polynomials, a very little known discipline in the theoretical computer science world.
We are living in incredible times! Maybe somebody will finally prove P = NP?
There were no advancements in this field since 1976 when Nicos Christofides came up with an algorithm that efficiently finds approximate solutions — round trips that are at most 50% longer than the best round trip.
Funny enough that the novel algorithm improves the previous approximate algorithm by a whopping margin of 2.0 x 10^-36 !!! (Yes, it is 0.2 billionth of a trillionth of a trillionth of a percent.) But please don't be too disappointed (although I was). This result breaks a theoretical and psychological barrier that persisted for more than forty years. And hopefully, it will spike the interest of the broader community about this problem and will lead to further advancements in the next years. Moreover, it is likely (although not proven yet) that the proposed algorithm is much more efficient than the predecessor in most of the cases and it improves at least by that tiny margin in the worst case.
As a Deep Learning evangelist, my first impression after reading the caption was that it was another victory of Neural Networks, however, I was wrong, and yet not all the cool stuff is done with the help of NNs. The method is based on the machinery called the geometry of polynomials, a very little known discipline in the theoretical computer science world.
We are living in incredible times! Maybe somebody will finally prove P = NP?
Quanta Magazine
Computer Scientists Break Traveling Salesperson Record
After 44 years, there’s finally a better way to find approximate solutions to the notoriously difficult traveling salesperson problem.