Finetuning Pretrained Transformers into RNNs
Microsoft+Deepmind+...
Transformers is the current SOTA in language modeling. But they come with significant computational overhead, as the attention mechanism scales quadratically in sequence length. The memory consumption also grows linearly as the sequence becomes longer. This bottleneck limits the usage of large-scale pretrained generation models, such as GPT-3 or Image transformers.
Several efficient transformer variants have been proposed recently. For example, a linear-complexity recurrent variant has proven well suited for an autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps but can be difficult to train or yield suboptimal accuracy.
This work converts a pretrained transformer into its efficient linear-complexity recurrent counterpart with a learned feature map to improve the efficiency while retaining the accuracy. To achieve this, they replace the softmax attention in an off-the-shelf pretrained transformer with its linear-complexity recurrent alternative and then finetune.
➕ Pros:
+ The finetuning process requires much less GPU time than training the recurrent variants from scratch
+ Converting a large off-the-shelf transformer to a lightweight inference model w/o repeating the whole training procedure is very handy in many downstream applications.
📝 arxiv.org/abs/2103.13076
Microsoft+Deepmind+...
Transformers is the current SOTA in language modeling. But they come with significant computational overhead, as the attention mechanism scales quadratically in sequence length. The memory consumption also grows linearly as the sequence becomes longer. This bottleneck limits the usage of large-scale pretrained generation models, such as GPT-3 or Image transformers.
Several efficient transformer variants have been proposed recently. For example, a linear-complexity recurrent variant has proven well suited for an autoregressive generation. It approximates the softmax attention with randomized or heuristic feature maps but can be difficult to train or yield suboptimal accuracy.
This work converts a pretrained transformer into its efficient linear-complexity recurrent counterpart with a learned feature map to improve the efficiency while retaining the accuracy. To achieve this, they replace the softmax attention in an off-the-shelf pretrained transformer with its linear-complexity recurrent alternative and then finetune.
➕ Pros:
+ The finetuning process requires much less GPU time than training the recurrent variants from scratch
+ Converting a large off-the-shelf transformer to a lightweight inference model w/o repeating the whole training procedure is very handy in many downstream applications.
📝 arxiv.org/abs/2103.13076
This media is not supported in your browser
VIEW IN TELEGRAM
Ever wanted to run dense pose recognition on animals 🐶 but didn't have labels?
Facebook AI Research
Now you can train on animals w/o annotations using teacher-student training and utilize datasets with labeled Humans to generalize to animals.
Facebook AI Research recently released the source code (as a part of detectron2) for our paper and pretrained teacher and student models for Chimps dataset. Huge shout-out to Vasil Khalidov for this release!
We (yes, I'm the first author), introduced the DensePose Evolution framework, which can be used to bootstrap DensePose on unlabeled data with animals.
⚙️ DensePose Evolution source code and models
🌀DensePose Evolution proj page
The method explained below 👇
Facebook AI Research
Now you can train on animals w/o annotations using teacher-student training and utilize datasets with labeled Humans to generalize to animals.
Facebook AI Research recently released the source code (as a part of detectron2) for our paper and pretrained teacher and student models for Chimps dataset. Huge shout-out to Vasil Khalidov for this release!
We (yes, I'm the first author), introduced the DensePose Evolution framework, which can be used to bootstrap DensePose on unlabeled data with animals.
⚙️ DensePose Evolution source code and models
🌀DensePose Evolution proj page
The method explained below 👇
Densepose Evolution Models & Bootstrapping Pipeline
🔬 The training proceeds in two stages (see image below):
1. First, a master model is trained on data from the source domain (humans with full DensePose annotation S, I, U and V) and supporting domain (animals with segmentation annotation only). Only selected animal classes are chosen from the supporting domain through category filters to guarantee the quality of target domain results. The training is done in a class-agnostic manner: all selected categories are mapped to a single category (human).
2. Second, a student model is trained on data from source and supporting domains, as well as data from target domain obtained by applying the master model, selecting high-confidence detections, and sampling the results.
⚙️ What is included in the GitHub repository:
1. Models that perform estimation of confidence in regressed UV coordinates as well as confidences associated with coarse and fine segmentation.
2. Master and student models trained using the bootstrapping pipeline with chimpanzee as the target category.
3. The source code for the entire pipeline.
🦧 Model Zoo
👨🎓 For a more exhaustive explanation of this method please check my older post.
🔬 The training proceeds in two stages (see image below):
1. First, a master model is trained on data from the source domain (humans with full DensePose annotation S, I, U and V) and supporting domain (animals with segmentation annotation only). Only selected animal classes are chosen from the supporting domain through category filters to guarantee the quality of target domain results. The training is done in a class-agnostic manner: all selected categories are mapped to a single category (human).
2. Second, a student model is trained on data from source and supporting domains, as well as data from target domain obtained by applying the master model, selecting high-confidence detections, and sampling the results.
⚙️ What is included in the GitHub repository:
1. Models that perform estimation of confidence in regressed UV coordinates as well as confidences associated with coarse and fine segmentation.
2. Master and student models trained using the bootstrapping pipeline with chimpanzee as the target category.
3. The source code for the entire pipeline.
🦧 Model Zoo
👨🎓 For a more exhaustive explanation of this method please check my older post.
Can Vision Transformers Learn without Natural Images? YES!🔥
This is very exciting. It was shown that we can pretrain Vision Transformers purely on synthetic fractal dataset w/o any manual annotations and achieve similar performance on downstream tasks as self-supervised pretraining on ImageNet and similar performance to supervised pretraining on other datasets like Places.
Authors also pretrained regular ResNets on their fractal synthetic data. It works pretty well too, although DeiT Transformers are better.
Overall, this is good news. If we can come up with clever approaches to synthetic data generation, then we can generate arbitrarily large datasets for free!
📖 Paper
🌐 Proj page
📦 Fractal dataset is described in this paper.
This is very exciting. It was shown that we can pretrain Vision Transformers purely on synthetic fractal dataset w/o any manual annotations and achieve similar performance on downstream tasks as self-supervised pretraining on ImageNet and similar performance to supervised pretraining on other datasets like Places.
Authors also pretrained regular ResNets on their fractal synthetic data. It works pretty well too, although DeiT Transformers are better.
Overall, this is good news. If we can come up with clever approaches to synthetic data generation, then we can generate arbitrarily large datasets for free!
📖 Paper
🌐 Proj page
📦 Fractal dataset is described in this paper.
More results and the visualized attention maps. Models pretrained of Fractal dataset tend to focus on object edges.
Positional Encodings and Positional Embeddings for Self-Attention Explained
Vanilla Transformers are permutation-invariant models. By default, the output of the model will not change if you permute all words in the input sentence. But this is really bad for language modeling and for Image recognition, as sentences and images have a specific structure and the order of words and pixels do change the semantic meaning.
Consequently, for successful learning, there is a need to incorporate the order of the words/pixels in the input sequence into our self-attention model. This can be done by explicitly attaching the information about the order to every element in a sequence before feeding it to the model. The most widely used approaches are Precomputed Sinusoidal Positional Encodings and Learnable Positional Embeddings.
🟡 In the case of Sinusoidal Positional Encodings, position
🟢 In the case of Positional Embeddings, for every possible position
To learn more details about Positional Encodings and Embeddings and how to implement them, refer to the following blogposts:
📜 Positional Encodings
📃 Positional Embeddings
Vanilla Transformers are permutation-invariant models. By default, the output of the model will not change if you permute all words in the input sentence. But this is really bad for language modeling and for Image recognition, as sentences and images have a specific structure and the order of words and pixels do change the semantic meaning.
Consequently, for successful learning, there is a need to incorporate the order of the words/pixels in the input sequence into our self-attention model. This can be done by explicitly attaching the information about the order to every element in a sequence before feeding it to the model. The most widely used approaches are Precomputed Sinusoidal Positional Encodings and Learnable Positional Embeddings.
🟡 In the case of Sinusoidal Positional Encodings, position
i
is encoded by a series of K
sine-cosine pairs (sin(w_k t), cos(w_k t))
with decreasing frequencies w_k, k=1, K
.🟢 In the case of Positional Embeddings, for every possible position
i
we randomly initialize a learnable d-dimensional embedding p_i
and concatenate it to every element in the input sequence.To learn more details about Positional Encodings and Embeddings and how to implement them, refer to the following blogposts:
📜 Positional Encodings
📃 Positional Embeddings
PlenOctrees For Real-time Rendering of Neural Radiance Fields
And yet another speed-up of NERF. Exactly the same idea as in FastNeRF and NEX (predict spherical harmonics coefficients k) - incredible! It's the first time I see so many concurrent papers sharig the same idea. But this one has code at least, which makes it the best!
📝 Paper arxiv.org/abs/2103.14024
🌐Project page alexyu.net/plenoctrees/
🛠Code github.com/sxyu/volrend
And yet another speed-up of NERF. Exactly the same idea as in FastNeRF and NEX (predict spherical harmonics coefficients k) - incredible! It's the first time I see so many concurrent papers sharig the same idea. But this one has code at least, which makes it the best!
📝 Paper arxiv.org/abs/2103.14024
🌐Project page alexyu.net/plenoctrees/
🛠Code github.com/sxyu/volrend
Most of the Recent Advancements in Transformers are Useless😱
Google Research
Google study shows Transformer Modifications Fail To Transfer Across Implementations and Applications.
The researchers began by reimplementing and evaluating a variety of transformer variants on the tasks where they are most commonly applied. As a baseline, they used the original transformer model with two modifications: applying layer normalization before the self-attention and feed-forward blocks instead of after, and using relative attention with shared biases instead of sinusoidal positional embeddings.
👀 Surprise!
Most architecture modifications they looked at do not meaningfully improve performance on downstream NLP tasks - they fail to transfer across implementations and applications. See the table below👇 with results for transfer learning based on T5, and supervised machine translation on the WMT'14 English-German benchmark.
😅 Simple ideas are always the best, and more compute never hurts!
Modifications that were proved to improve performance are either (1) relatively simple (e.g. a change in activation function) , or (2) rely on increase in parameter count or FLOPs (e.g. the Switch Transformer or Universal Transformer). And this makes total sense to me.
My take on the reasons for such results is that researchers are often pressured by the urge to publishing new papers every year. This spurs cherry-picking of the results, overstated claims, and spurious architectural modifications. The performance increase shown in many papers is just a result of overfitting over a specific benchmark or more accurate hyperparameter selection compared to the previous work. And such phenomenon is not only inherent for transformer and NLP papers but for other subfields of Deep Learning research as well.
📝 Arxiv paper
Thanks @ai_newz for the pointer!
Google Research
Google study shows Transformer Modifications Fail To Transfer Across Implementations and Applications.
The researchers began by reimplementing and evaluating a variety of transformer variants on the tasks where they are most commonly applied. As a baseline, they used the original transformer model with two modifications: applying layer normalization before the self-attention and feed-forward blocks instead of after, and using relative attention with shared biases instead of sinusoidal positional embeddings.
👀 Surprise!
Most architecture modifications they looked at do not meaningfully improve performance on downstream NLP tasks - they fail to transfer across implementations and applications. See the table below👇 with results for transfer learning based on T5, and supervised machine translation on the WMT'14 English-German benchmark.
😅 Simple ideas are always the best, and more compute never hurts!
Modifications that were proved to improve performance are either (1) relatively simple (e.g. a change in activation function) , or (2) rely on increase in parameter count or FLOPs (e.g. the Switch Transformer or Universal Transformer). And this makes total sense to me.
My take on the reasons for such results is that researchers are often pressured by the urge to publishing new papers every year. This spurs cherry-picking of the results, overstated claims, and spurious architectural modifications. The performance increase shown in many papers is just a result of overfitting over a specific benchmark or more accurate hyperparameter selection compared to the previous work. And such phenomenon is not only inherent for transformer and NLP papers but for other subfields of Deep Learning research as well.
📝 Arxiv paper
Thanks @ai_newz for the pointer!
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
It's Sunday! So for your attention is Sparky, a robodog from Australia🇦🇺.
Looks like he is a decent competitor for Spot from Boston Dynamics.
Looks like he is a decent competitor for Spot from Boston Dynamics.