Facebook open-sourced a library for state-of-the-art self-supervised learning: VISSL.
+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.
Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!
VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
+ It contains reproducible reference implementation of SOTA self-supervision approaches (like SimCLR, MoCo, PIRL, SwAV etc) and their components that can be reused. Also supports supervised trainings.
+ Easy to train model on 1-gpu, multi-gpu and multi-node. Seamless scaling to large scale data and model sizes with FP16, LARC etc.
Finally somebody unified all recent works in one modular framework. I don't know about you, but I'm very happy 😌!
VISSL: https://vissl.ai/
Blogpost: https://ai.facebook.com/blog/seer-the-start-of-a-more-powerful-flexible-and-accessible-era-for-computer-vision
Tutorials in Google Colab: https://vissl.ai/tutorials/
Self-supervised Pretraining of Visual Features in the Wild
Facebook also published its ultimate SElf-supERvised (SEER) model.
- They pretrained it on a 1B random, unlabeled and uncurated Instagram images 👀.
- SEER outperformed SOTA self-supervised systems, reaching 84.2% top-1 accuracy on ImageNet.
- SEER also outperformed SOTA supervised models on downstream tasks, including low-shot, object detection, segmentation, and image classification.
- When trained with just 10% of the ImageNet, SEER still achieved 77.9% top-1 accuracy on the full data set. When trained with just 1% of the annotated ImageNet examples, SEER achieved 60.5% top-1 accuracy.
- SEER is based on recent RegNet achitecture . Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet l models while being up to 5x faster on GPUs.
📝 Paper
📖 Blogpost
⚙️ I guess the source code will be published as a part of VISSL soon.
Facebook also published its ultimate SElf-supERvised (SEER) model.
- They pretrained it on a 1B random, unlabeled and uncurated Instagram images 👀.
- SEER outperformed SOTA self-supervised systems, reaching 84.2% top-1 accuracy on ImageNet.
- SEER also outperformed SOTA supervised models on downstream tasks, including low-shot, object detection, segmentation, and image classification.
- When trained with just 10% of the ImageNet, SEER still achieved 77.9% top-1 accuracy on the full data set. When trained with just 1% of the annotated ImageNet examples, SEER achieved 60.5% top-1 accuracy.
- SEER is based on recent RegNet achitecture . Under comparable training settings and flops, the RegNet models outperform the popular EfficientNet l models while being up to 5x faster on GPUs.
📝 Paper
📖 Blogpost
⚙️ I guess the source code will be published as a part of VISSL soon.
Meta
SEER: The start of a more powerful, flexible, and accessible era for computer vision
The future of AI is in creating systems that can learn directly from whatever information they’re given — whether it’s text, images, or another type of data — without relying on carefully curated and labeled data sets to teach them how to recognize objects…
This media is not supported in your browser
VIEW IN TELEGRAM
Synthesized StyleGAN2 portrait was tuned using a textual description using CLIP encoder. A man was transformed into a vampire by navigating in the latent space using a query "an image of a man resembling a vampire, with the face of Count Dracula". Video attached.
For me this looks like a sorcery ✨.
➖ Link to the source twitt
📓 Colab notebook
For me this looks like a sorcery ✨.
➖ Link to the source twitt
📓 Colab notebook
Barlow Twins: Self-Supervised Learning via Redundancy Reduction
Yann LeCun, FAIR
New self-supervised learning loss: compute cross-correlation matrix between the features of two distorted versions of a sample and make it as close to the identity matrix as possible.
+ This naturally avoids representation collapse and causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
+ It is also robust to the training batch size.
+ Comparable to SOTA self-supervised methods (similar results as BYOL), but the method is conceptually simpler.
⚙️ My favorite part, training resources: 32x V100 GPUs, approx. 124 hours
📝 Paper
🛠 Code (will be released soon)
Yann LeCun, FAIR
New self-supervised learning loss: compute cross-correlation matrix between the features of two distorted versions of a sample and make it as close to the identity matrix as possible.
+ This naturally avoids representation collapse and causes the representation vectors of distorted versions of a sample to be similar, while minimizing the redundancy between the components of these vectors.
+ It is also robust to the training batch size.
+ Comparable to SOTA self-supervised methods (similar results as BYOL), but the method is conceptually simpler.
⚙️ My favorite part, training resources: 32x V100 GPUs, approx. 124 hours
📝 Paper
🛠 Code (will be released soon)
Self-supervised learning: The dark matter of intelligence
Blog post by Yann LeCun and Ishan Misra - well-known experts in self-supervised learning at FAIR.
They talk about:
- Self-supervised learning as a paradigm in general
- Self-supervised learning as predictive learning,
- Self-supervised learning for language versus vision
- Modeling the uncertainty in prediction
- A unified view of self-supervised methods
- Self-supervised learning at Facebook
Some excerpts:
As babies, we learn how the world works largely by observation. We form generalized predictive models about objects in the world by learning concepts such as object permanence and gravity. Later in life, we observe the world, act on it, observe again, and build hypotheses to explain how our actions change our environment by trial and error.
We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.
📎 Read more here.
Blog post by Yann LeCun and Ishan Misra - well-known experts in self-supervised learning at FAIR.
They talk about:
- Self-supervised learning as a paradigm in general
- Self-supervised learning as predictive learning,
- Self-supervised learning for language versus vision
- Modeling the uncertainty in prediction
- A unified view of self-supervised methods
- Self-supervised learning at Facebook
Some excerpts:
As babies, we learn how the world works largely by observation. We form generalized predictive models about objects in the world by learning concepts such as object permanence and gravity. Later in life, we observe the world, act on it, observe again, and build hypotheses to explain how our actions change our environment by trial and error.
We believe that self-supervised learning (SSL) is one of the most promising ways to build such background knowledge and approximate a form of common sense in AI systems.
📎 Read more here.
“Long term, progress in AI will come from programs that just watch videos all day and learn like a baby. ... Childrean learn by watching the spectacle of the world.
But when the spectacle of the world is captured by a camera, it's a video." -
@Yann Lecun
I can only add here, that AI might be also learnign from interacting with its environment (at least a simulated one).
Blogpost with high-level reflection on self-supervised learning at wired.com.
But when the spectacle of the world is captured by a camera, it's a video." -
@Yann Lecun
I can only add here, that AI might be also learnign from interacting with its environment (at least a simulated one).
Blogpost with high-level reflection on self-supervised learning at wired.com.
Wired
Facebook’s New AI Teaches Itself to See With Less Human Help
Most image recognition algorithms require lots of labeled pictures. This new approach eliminates the need for most of the labeling.
This media is not supported in your browser
VIEW IN TELEGRAM
Visualising Neurons in Artificial Neural Networks
What a surprise, openAI discovered yet another time that neurons can be interpretable 😂 now they showed neurons for their recently hyped CLIP model.
https://openai.com/blog/multimodal-neurons/
What a surprise, openAI discovered yet another time that neurons can be interpretable 😂 now they showed neurons for their recently hyped CLIP model.
https://openai.com/blog/multimodal-neurons/
Regarding the typographic attack in the previous post. Apparently, It can be avoided if you give proper query text string. For example “wait a second, this is just an apple with a label saying iPod” will get a higher confidence than the “iPod”
This was discovered by Yannic.
This was discovered by Yannic.
This media is not supported in your browser
VIEW IN TELEGRAM
It is Sunday, pancake time 👌🏻. So I could not resist sharing this spectacular Deep Fake with you.
Neural Funk: AI generates endless breakbeats
Enthusiasts from Skoltech have trained a WaveGAN on 7500 vintage drum loops, then used the resulting model to generate thousands of new drum loops.
I have attached my favorite 6-minute sample (147 bpm). Love it!
The result was obtained by moving a point slowly through a random trajectory in the model’s latent space. Each point in the latent space corresponds to either an existing or non-existing break. Linear movement between two points results in a smooth transition between two corresponding breaks.
The pace of progress in synthetic audio and image generation is mind-blowing. Will we be able to generate infinite-movies? Imagine an infinite Harry Potter story or an endless New Year's speech of Putin 😅
▶️ A 6-hour Neural Funk on YouTube
🎧 A 6-hour sequence in wav format
📓Colab notebook with pretrained models
Enthusiasts from Skoltech have trained a WaveGAN on 7500 vintage drum loops, then used the resulting model to generate thousands of new drum loops.
I have attached my favorite 6-minute sample (147 bpm). Love it!
The result was obtained by moving a point slowly through a random trajectory in the model’s latent space. Each point in the latent space corresponds to either an existing or non-existing break. Linear movement between two points results in a smooth transition between two corresponding breaks.
The pace of progress in synthetic audio and image generation is mind-blowing. Will we be able to generate infinite-movies? Imagine an infinite Harry Potter story or an endless New Year's speech of Putin 😅
▶️ A 6-hour Neural Funk on YouTube
🎧 A 6-hour sequence in wav format
📓Colab notebook with pretrained models
Interview with Natalia Neverova - Research Lead at Facebook AI Research
Natalia Neverova was one of my research advisors during my internship at Facebook AI Research. In this interview, she talks about the research at FAIR, which students do they prefer to hire, 3D reconstruction of people and animals (3D animals 🐒 was exactly my research project at FAIR).
🌐 Link to the interview (unfortunately, only in Russian)
Natalia Neverova was one of my research advisors during my internship at Facebook AI Research. In this interview, she talks about the research at FAIR, which students do they prefer to hire, 3D reconstruction of people and animals (3D animals 🐒 was exactly my research project at FAIR).
🌐 Link to the interview (unfortunately, only in Russian)
YouTube
Transferring Dense Pose to Proximal Animal Classes (CVRP2020)
Frame-by-frame results produced by our model after self-training.
Project url: https://asanakoy.github.io/densepose-evolution/
Project url: https://asanakoy.github.io/densepose-evolution/
China trains a 10billion parameter multimodal network… using NVIDIA’s code:
A hybrid team of researchers from Alibaba and Tsinghua University have built M6, a “Multi-Modality to Multi-Modality Multitask Mega-transformer”. M6 is a multi-modal model trained on a huge corpus of text and image data, including image-text pairs (similar to recent systems like OpenAI’s CLIP). M6 has a broad capability surface and because of how it was trained, you can use M6 to search for an image or vice versa, generate media in different modalities, match images together, write poems, answer questions, and so on.
📦 Data: ~60 million images (with accompanying text pairs) totalling 1.9TB (almost twice the raw size of ImageNet), plus 292GB of text.
A hybrid team of researchers from Alibaba and Tsinghua University have built M6, a “Multi-Modality to Multi-Modality Multitask Mega-transformer”. M6 is a multi-modal model trained on a huge corpus of text and image data, including image-text pairs (similar to recent systems like OpenAI’s CLIP). M6 has a broad capability surface and because of how it was trained, you can use M6 to search for an image or vice versa, generate media in different modalities, match images together, write poems, answer questions, and so on.
📦 Data: ~60 million images (with accompanying text pairs) totalling 1.9TB (almost twice the raw size of ImageNet), plus 292GB of text.
📌 Facts and figures: Though the authors say they’ve trained a 10billion and 100billion parameter model, they mostly report performance statistics for the 10billion. The 100b is a mixture-of-experts model, while the 10b is based on NVIDIA’s Megatron training code. The model’s size and sophistication is notable – this feels like a symptom of the maturing capabilities of various Chinese AI organization. I wonder when we’ll get an M6-scale system from people affiliated with India, or regions like Europe or Africa.
🤷🏼♂️ Why this matters: M6 is notable for being a non-English model at equivalent scale to some of the largest primarily-English ones. We’re entering an era where there will be multiple, gigantic AI models, with variations stemming from the organizations that trained them. It’s also interesting to consider how these models proliferate, and who will get access to them. Will students and researchers at Tsinghua get access to M6, or just Alibaba’s researchers, or both? And how might access schemes develop in other countries, as well?
🌀 A word about bias: There’s no discussion of bias in the paper (or ethics), which isn’t typical for papers of this type but is typical of papers that come out of Chinese research organizations 😉
📝 ArXiv Paper link
—
Source: https://jack-clark.net/
🤷🏼♂️ Why this matters: M6 is notable for being a non-English model at equivalent scale to some of the largest primarily-English ones. We’re entering an era where there will be multiple, gigantic AI models, with variations stemming from the organizations that trained them. It’s also interesting to consider how these models proliferate, and who will get access to them. Will students and researchers at Tsinghua get access to M6, or just Alibaba’s researchers, or both? And how might access schemes develop in other countries, as well?
🌀 A word about bias: There’s no discussion of bias in the paper (or ethics), which isn’t typical for papers of this type but is typical of papers that come out of Chinese research organizations 😉
📝 ArXiv Paper link
—
Source: https://jack-clark.net/
The results, honestly, are quite good. Especially enjoyed the humble opinion about "The Great Wall" 😄