This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
This media is not supported in your browser
VIEW IN TELEGRAM
It's Sunday! So for your attention is Sparky, a robodog from Australiaπ¦πΊ.
Looks like he is a decent competitor for Spot from Boston Dynamics.
Looks like he is a decent competitor for Spot from Boston Dynamics.
ββSwin Transformer: New SOTA backbone for Computer Visionπ₯
MS Research Asia
π What?
New vision Transformer architecture called Swin Transformer that can serve as a backbone in computer vision instead of CNNs.
βWhy?
There are two main problems with the usage of Transformers for computer vision.
1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes on the scene)
2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
π₯ The main ideas of the Swin Transformers:
1. Hierarchical feature maps where at each level of hierarchy Self-attention is applied within local non-overlapping windows. The size of the windows is progressively increased with the network depth (inspired by CNNs). This enables building architectures similar to feature pyramid networks (FPN) or U-Net for dense pixel-level tasks.
2. Window-based Self-attention reduces the computational overhead.
βοΈ Overall Architecture consists of repeating the following blocks:
- Split RGB image into non-overlapping patches (tokens).
- Apply MLP to translate raw features into an arbitrary dimension.
- Apply 2 consecutive Swin Transformer blocks with Window self-attention: both blocks have the same window size, but the second block uses shifted by `patch_size/2` windows which allows information flow between non-overlapping windows.
- Downsampling layer: Reduce the number of tokens by merging neighboring patches in a 2x2 window, and double the feature depth.
π¦Ύ Results
+ Outperforms SOTA by a significant margin on COCO segmentation and detection tasks and ADE20K segmentation.
+ Comparable accuracy to the EfficientNet family on ImageNet-1K classification, while being faster.
ππ»Conclusion
While Transformers are super flexible, researchers start to inject in Transformers inductive biases similar to those in CNNs, e.g., local connectivity, feature hierarchies. And this seems to help tremendously!
π Paper
β Code (promissed soon)
π TL;DR blogpost
MS Research Asia
π What?
New vision Transformer architecture called Swin Transformer that can serve as a backbone in computer vision instead of CNNs.
βWhy?
There are two main problems with the usage of Transformers for computer vision.
1. Existing Transformer-based models have tokens of a fixed scale. However, in contrast to the word tokens, visual elements can be different in scale (e.g. objects of varying sizes on the scene)
2. Regular self-attention requires quadratic of the image size number of operations, limiting applications in computer vision where high resolution is necessary (e.g., instance segmentation).
π₯ The main ideas of the Swin Transformers:
1. Hierarchical feature maps where at each level of hierarchy Self-attention is applied within local non-overlapping windows. The size of the windows is progressively increased with the network depth (inspired by CNNs). This enables building architectures similar to feature pyramid networks (FPN) or U-Net for dense pixel-level tasks.
2. Window-based Self-attention reduces the computational overhead.
βοΈ Overall Architecture consists of repeating the following blocks:
- Split RGB image into non-overlapping patches (tokens).
- Apply MLP to translate raw features into an arbitrary dimension.
- Apply 2 consecutive Swin Transformer blocks with Window self-attention: both blocks have the same window size, but the second block uses shifted by `patch_size/2` windows which allows information flow between non-overlapping windows.
- Downsampling layer: Reduce the number of tokens by merging neighboring patches in a 2x2 window, and double the feature depth.
π¦Ύ Results
+ Outperforms SOTA by a significant margin on COCO segmentation and detection tasks and ADE20K segmentation.
+ Comparable accuracy to the EfficientNet family on ImageNet-1K classification, while being faster.
ππ»Conclusion
While Transformers are super flexible, researchers start to inject in Transformers inductive biases similar to those in CNNs, e.g., local connectivity, feature hierarchies. And this seems to help tremendously!
π Paper
β Code (promissed soon)
π TL;DR blogpost
This media is not supported in your browser
VIEW IN TELEGRAM
Boston Dynamics unveiled a New Robot for working at warehouses!
Watch Stretch - new case handling robot - move, groove and unload trucks.
Watch Stretch - new case handling robot - move, groove and unload trucks.
Forwarded from Self Supervised Boy
Interactive Weak Supervision paper from ICLR 2021.
In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.
There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.
With a bit deeper description (and one unanswered question) here.
Source (and rebuttal comments with important links) there.
In contrast to classical active learning where experts are queried to assess individual samples, the idea of this paper is to assess labeling heuristics being automatically generated. Authors argue that since experts are good in writing such heuristics from scratch, they should be able to label auto-generated heuristics. To be able to rank non-assessed heuristics authors proposed to train an ensemble of models to predict the assessors' mark for the heuristic. As input for these models authors proposed to use fingerprint of the heuristic: concatenated predictions on some subset of data.
There is no very fancy results, there is some concerns raised by reviewers, and there are some strange notations in this paper. Yet the idea looks interesting to me.
With a bit deeper description (and one unanswered question) here.
Source (and rebuttal comments with important links) there.
Π―ΡΠΎΡΠ»Π°Π²'s Notion on Notion
Interactive Weak Supervision: Learning Useful Heuristics for Data Labeling | Notion
Authors proposed a new pipeline for interactive weak supervision. Instead of asking users for sample labeling authors proposed to ask for labeling of the labeling functions (LF), e.g. regular expressions for text parsing. Authors argues that since expertsβ¦
CvT: Introducing Convolutions to Vision Transformers
Another improvement for Vision transformers! Inject inductive biases of CNNs (i.e. shift, scale, and distortion invariance) to the ViT architecture while maintaining the flexibility of Transformers.
βHow?
Main architectural novelties:
- Hierarchical architecture
- New convolutional token embedding
- Convolutional projections before self-attention instead of the linear which was used in ViT. This is where convolutions come into play.
β Results:
Almost SOTA on Imagenet 1K and 22K: 83.3% and 87.7%.
Almost because Swin Transformers with local window self-attention layers and downsampling layers are a bit stronger (see the image with results) and perhaps faster.
π€ Looks like it is a trend now to incorporate useful structural properties of CNN into Transformers. I'm pretty sure, we will see more papers like this in the next few months.
π Paper arxiv.org/abs/2103.15808
Another improvement for Vision transformers! Inject inductive biases of CNNs (i.e. shift, scale, and distortion invariance) to the ViT architecture while maintaining the flexibility of Transformers.
βHow?
Main architectural novelties:
- Hierarchical architecture
- New convolutional token embedding
- Convolutional projections before self-attention instead of the linear which was used in ViT. This is where convolutions come into play.
β Results:
Almost SOTA on Imagenet 1K and 22K: 83.3% and 87.7%.
Almost because Swin Transformers with local window self-attention layers and downsampling layers are a bit stronger (see the image with results) and perhaps faster.
π€ Looks like it is a trend now to incorporate useful structural properties of CNN into Transformers. I'm pretty sure, we will see more papers like this in the next few months.
π Paper arxiv.org/abs/2103.15808
New aggregator of trending papersπ
It uses number of tweets as a paper hotness score.
You can also create your own reading lists there, add notes about papers and follow other users.
https://42papers.com/
Here is my profile https://42papers.com/u/gradient-dude-933
It uses number of tweets as a paper hotness score.
You can also create your own reading lists there, add notes about papers and follow other users.
https://42papers.com/
Here is my profile https://42papers.com/u/gradient-dude-933
ββStyleCLIP: Text-Driven Manipulation of StyleGAN Imagery π₯
Adobe Research
Contrastive Language-Image Pretraining (CLIP) models in order to navigate image editing by text queries.
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector
3. Now, given a source latent code
a) Distance between generated by StyleGAN image and the text query;
b) Regularization loss penalizing large deviation of the source vector
c) Identity loss, which makes sure that the identity of the generated face is the same as the original one. This is done by minimizing the distance between images in the ArcFace face recognition network embedding space.
Such an image editing process requires iterative optimization of the latent code
The overall idea of the paper is not super novel and has been around for some time already. But this is the first formal paper on such an image editing approach using CLIP and StyleGAN.
Other related papers, recently discussed in this channel:
βͺοΈPaint by Word
βͺοΈUsing latent space regression to analyze and leverage compositionality in GANs
π StyleCLIP Paper
βοΈ StyleCLIp code
Adobe Research
Contrastive Language-Image Pretraining (CLIP) models in order to navigate image editing by text queries.
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector
w_s
. 3. Now, given a source latent code
w_sβ W+
, and a directive in natural language, or a text prompt t
, we iteratively minimize the sum of three losses by changing the latent code w
: a) Distance between generated by StyleGAN image and the text query;
b) Regularization loss penalizing large deviation of the source vector
w_s
c) Identity loss, which makes sure that the identity of the generated face is the same as the original one. This is done by minimizing the distance between images in the ArcFace face recognition network embedding space.
Such an image editing process requires iterative optimization of the latent code
w
(usually 200-300 iterations) for several minutes. To make it faster authors propose a feed-forward method, where instead of optimization, another neural network predicts the residuals which are added to the latent code w
to produce the desired image alterations.The overall idea of the paper is not super novel and has been around for some time already. But this is the first formal paper on such an image editing approach using CLIP and StyleGAN.
Other related papers, recently discussed in this channel:
βͺοΈPaint by Word
βͺοΈUsing latent space regression to analyze and leverage compositionality in GANs
π StyleCLIP Paper
βοΈ StyleCLIp code
ββHaloNet: Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Google research
Novel computer vision backbone - HaloNet. Yes, yet another one! π€·πΌββοΈ
Authors develop a new family of parameter-efficient local self-attention models, HaloNets, that outperform EfficientNet in the parameter-accuracy tradeoff on ImageNet.
HaloNets show strong results on ImageNet-1k, and promising improvements (up to 4.4x inference speedups) over strong baselines when pretrained on ImageNet-21k with comparable settings.
The ideas are similar to Swin Transformers and CvT: local self-attention, attention-based downsampling layers, a mix of regular convolutions, and self-attention blocks.
In their previous work, the authors used pixel-centered windows, similar to convolutions. Here, they develop a block-centered formulation for better efficiency on matrix accelerators (GPU, TPU). They also introduce the attention downsampling layer
When applied to the detection and instance segmentation, the proposed local self-attention improves on top of strong convolutional baselines. Interestingly, local self-attention with 14x14 receptive fields performs nearly as well as 35x35.
π Paper
No code yet!
Google research
Novel computer vision backbone - HaloNet. Yes, yet another one! π€·πΌββοΈ
Authors develop a new family of parameter-efficient local self-attention models, HaloNets, that outperform EfficientNet in the parameter-accuracy tradeoff on ImageNet.
HaloNets show strong results on ImageNet-1k, and promising improvements (up to 4.4x inference speedups) over strong baselines when pretrained on ImageNet-21k with comparable settings.
The ideas are similar to Swin Transformers and CvT: local self-attention, attention-based downsampling layers, a mix of regular convolutions, and self-attention blocks.
In their previous work, the authors used pixel-centered windows, similar to convolutions. Here, they develop a block-centered formulation for better efficiency on matrix accelerators (GPU, TPU). They also introduce the attention downsampling layer
When applied to the detection and instance segmentation, the proposed local self-attention improves on top of strong convolutional baselines. Interestingly, local self-attention with 14x14 receptive fields performs nearly as well as 35x35.
π Paper
No code yet!