āāStyleCLIP: Text-Driven Manipulation of StyleGAN Imagery š„
Adobe Research
Contrastive Language-Image Pretraining (CLIP) models in order to navigate image editing by text queries.
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector
3. Now, given a source latent code
a) Distance between generated by StyleGAN image and the text query;
b) Regularization loss penalizing large deviation of the source vector
c) Identity loss, which makes sure that the identity of the generated face is the same as the original one. This is done by minimizing the distance between images in the ArcFace face recognition network embedding space.
Such an image editing process requires iterative optimization of the latent code
The overall idea of the paper is not super novel and has been around for some time already. But this is the first formal paper on such an image editing approach using CLIP and StyleGAN.
Other related papers, recently discussed in this channel:
āŖļøPaint by Word
āŖļøUsing latent space regression to analyze and leverage compositionality in GANs
š StyleCLIP Paper
āļø StyleCLIp code
Adobe Research
Contrastive Language-Image Pretraining (CLIP) models in order to navigate image editing by text queries.
1. Take pretrained CLIP, pretrained StyleGAN, and pretrained ArcFace network for face recognition.
2. Project an input image in StyleGAN latent vector
w_s
. 3. Now, given a source latent code
w_sā W+
, and a directive in natural language, or a text prompt t
, we iteratively minimize the sum of three losses by changing the latent code w
: a) Distance between generated by StyleGAN image and the text query;
b) Regularization loss penalizing large deviation of the source vector
w_s
c) Identity loss, which makes sure that the identity of the generated face is the same as the original one. This is done by minimizing the distance between images in the ArcFace face recognition network embedding space.
Such an image editing process requires iterative optimization of the latent code
w
(usually 200-300 iterations) for several minutes. To make it faster authors propose a feed-forward method, where instead of optimization, another neural network predicts the residuals which are added to the latent code w
to produce the desired image alterations.The overall idea of the paper is not super novel and has been around for some time already. But this is the first formal paper on such an image editing approach using CLIP and StyleGAN.
Other related papers, recently discussed in this channel:
āŖļøPaint by Word
āŖļøUsing latent space regression to analyze and leverage compositionality in GANs
š StyleCLIP Paper
āļø StyleCLIp code
āāHaloNet: Scaling Local Self-Attention for Parameter Efficient Visual Backbones
Google research
Novel computer vision backbone - HaloNet. Yes, yet another one! š¤·š¼āāļø
Authors develop a new family of parameter-efficient local self-attention models, HaloNets, that outperform EfficientNet in the parameter-accuracy tradeoff on ImageNet.
HaloNets show strong results on ImageNet-1k, and promising improvements (up to 4.4x inference speedups) over strong baselines when pretrained on ImageNet-21k with comparable settings.
The ideas are similar to Swin Transformers and CvT: local self-attention, attention-based downsampling layers, a mix of regular convolutions, and self-attention blocks.
In their previous work, the authors used pixel-centered windows, similar to convolutions. Here, they develop a block-centered formulation for better efficiency on matrix accelerators (GPU, TPU). They also introduce the attention downsampling layer
When applied to the detection and instance segmentation, the proposed local self-attention improves on top of strong convolutional baselines. Interestingly, local self-attention with 14x14 receptive fields performs nearly as well as 35x35.
š Paper
No code yet!
Google research
Novel computer vision backbone - HaloNet. Yes, yet another one! š¤·š¼āāļø
Authors develop a new family of parameter-efficient local self-attention models, HaloNets, that outperform EfficientNet in the parameter-accuracy tradeoff on ImageNet.
HaloNets show strong results on ImageNet-1k, and promising improvements (up to 4.4x inference speedups) over strong baselines when pretrained on ImageNet-21k with comparable settings.
The ideas are similar to Swin Transformers and CvT: local self-attention, attention-based downsampling layers, a mix of regular convolutions, and self-attention blocks.
In their previous work, the authors used pixel-centered windows, similar to convolutions. Here, they develop a block-centered formulation for better efficiency on matrix accelerators (GPU, TPU). They also introduce the attention downsampling layer
When applied to the detection and instance segmentation, the proposed local self-attention improves on top of strong convolutional baselines. Interestingly, local self-attention with 14x14 receptive fields performs nearly as well as 35x35.
š Paper
No code yet!
Bored? Here is yet another Fast&Furious backbone for you!
New day - new SOTA on ImageNetš¤Æ
New day - new SOTA on ImageNetš¤Æ
Forwarded from Data Science by ODS.ai š¦
āāEfficientNetV2: Smaller Models and Faster Training
A new paper from Google Brain with a new SOTA architecture called EfficientNetV2. The authors develop a new family of CNN models that are optimized both for accuracy and training speed. The main improvements are:
- an improved training-aware neural architecture search with new building blocks and ideas to jointly optimize training speed and parameter efficiency;
- a new approach to progressive learning that adjusts regularization along with the image size;
As a result, the new approach can reach SOTA results while training faster (up to 11x) and smaller (up to 6.8x).
Paper: https://arxiv.org/abs/2104.00298
Code will be available here:
https://github.com/google/automl/efficientnetv2
A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-effnetv2
#cv #sota #nas #deeplearning
A new paper from Google Brain with a new SOTA architecture called EfficientNetV2. The authors develop a new family of CNN models that are optimized both for accuracy and training speed. The main improvements are:
- an improved training-aware neural architecture search with new building blocks and ideas to jointly optimize training speed and parameter efficiency;
- a new approach to progressive learning that adjusts regularization along with the image size;
As a result, the new approach can reach SOTA results while training faster (up to 11x) and smaller (up to 6.8x).
Paper: https://arxiv.org/abs/2104.00298
Code will be available here:
https://github.com/google/automl/efficientnetv2
A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-effnetv2
#cv #sota #nas #deeplearning
This media is not supported in your browser
VIEW IN TELEGRAM
VR Mind Control from NextMind: decode the act of focusing
Sorry Elon, no need to drill skulls anymore!
The NextMind sensor is non-invasive and can read electrical signals from the brains' visual cortex using small electrodes attached to the skin. Then the machine learning is used to decode brain activity and pinpoint the object of focus, allowing you to control game actions with your mind in real-time.
The sensor itself is surprisingly small and light ā it fits in the palm of your hand, with two arms that extend slightly beyond that. It easily fits under a baseball cap. You just need to ensure that the nine sets of two-pronged electrode sensors make contact with your skin
Currently, it's just a dev-kit that can be paired with 3rd party VR headsets including Oculus. The kit retails for $399 and can be already preordered. The functional is limited, but it is only the first step, I'm very excited to see the further development of this technology!
Full review is here.
Thanks @ai_newz for the pointer.
Sorry Elon, no need to drill skulls anymore!
The NextMind sensor is non-invasive and can read electrical signals from the brains' visual cortex using small electrodes attached to the skin. Then the machine learning is used to decode brain activity and pinpoint the object of focus, allowing you to control game actions with your mind in real-time.
The sensor itself is surprisingly small and light ā it fits in the palm of your hand, with two arms that extend slightly beyond that. It easily fits under a baseball cap. You just need to ensure that the nine sets of two-pronged electrode sensors make contact with your skin
Currently, it's just a dev-kit that can be paired with 3rd party VR headsets including Oculus. The kit retails for $399 and can be already preordered. The functional is limited, but it is only the first step, I'm very excited to see the further development of this technology!
Full review is here.
Thanks @ai_newz for the pointer.
āāSelf-supervised Learning for Medical images
Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned geometrically.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of image patches for self-supervised training.
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, random noise, random cut-outs) and train a decoder to restore the original view.
However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts. To make sure that the images are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.
Authors show that their method is significantly better than other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting image rotations). But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.
š Paper
š Code & Models
#paper_tldr #cv #self_supervised
Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned geometrically.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of image patches for self-supervised training.
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, random noise, random cut-outs) and train a decoder to restore the original view.
However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts. To make sure that the images are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.
Authors show that their method is significantly better than other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting image rotations). But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.
š Paper
š Code & Models
#paper_tldr #cv #self_supervised
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions
A framework that learns meaningful directions in GANs' latent space using unsupervised contrastive learning. Instead of discovering fixed directions such as in previous work, this method can discover non-linear directions in pretrained StyleGAN2 and BigGAN models. The discovered directions may be used for image manipulation.
Authors use the differences caused by an edit operation on the feature activations to optimize the identifiability of each direction. The edit operations are modeled by several separate neural nets
š Paper
š Code (next week)
#paper_tldr #cv #gan
A framework that learns meaningful directions in GANs' latent space using unsupervised contrastive learning. Instead of discovering fixed directions such as in previous work, this method can discover non-linear directions in pretrained StyleGAN2 and BigGAN models. The discovered directions may be used for image manipulation.
Authors use the differences caused by an edit operation on the feature activations to optimize the identifiability of each direction. The edit operations are modeled by several separate neural nets
ā_i(z)
and learning. Given a latent code z
and its generated image x = G(z)
, we seek to find edit operations ā_i(z)
such that the image x' = G(ā_i(z))
has semantically meaningful changes over x
while still preserving the identity of x
.š Paper
š Code (next week)
#paper_tldr #cv #gan