Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
​​EfficientNetV2: Smaller Models and Faster Training

A new paper from Google Brain with a new SOTA architecture called EfficientNetV2. The authors develop a new family of CNN models that are optimized both for accuracy and training speed. The main improvements are:

- an improved training-aware neural architecture search with new building blocks and ideas to jointly optimize training speed and parameter efficiency;
- a new approach to progressive learning that adjusts regularization along with the image size;

As a result, the new approach can reach SOTA results while training faster (up to 11x) and smaller (up to 6.8x).

Paper: https://arxiv.org/abs/2104.00298

Code will be available here:
https://github.com/google/automl/efficientnetv2

A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-effnetv2

#cv #sota #nas #deeplearning
This media is not supported in your browser
VIEW IN TELEGRAM
VR Mind Control from NextMind: decode the act of focusing

Sorry Elon, no need to drill skulls anymore!

The NextMind sensor is non-invasive and can read electrical signals from the brains' visual cortex using small electrodes attached to the skin. Then the machine learning is used to decode brain activity and pinpoint the object of focus, allowing you to control game actions with your mind in real-time.

The sensor itself is surprisingly small and light β€” it fits in the palm of your hand, with two arms that extend slightly beyond that. It easily fits under a baseball cap. You just need to ensure that the nine sets of two-pronged electrode sensors make contact with your skin

Currently, it's just a dev-kit that can be paired with 3rd party VR headsets including Oculus. The kit retails for $399 and can be already preordered. The functional is limited, but it is only the first step, I'm very excited to see the further development of this technology!

Full review is here.
Thanks @ai_newz for the pointer.
​​Self-supervised Learning for Medical images

Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned geometrically.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of image patches for self-supervised training.

The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, random noise, random cut-outs) and train a decoder to restore the original view.

However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts. To make sure that the images are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.

Authors show that their method is significantly better than other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting image rotations). But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.

πŸ“ Paper
πŸ›  Code & Models

#paper_tldr #cv #self_supervised
Results. TransVW (transferable visual words) is the proposed method.
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions

A framework that learns meaningful directions in GANs' latent space using unsupervised contrastive learning. Instead of discovering fixed directions such as in previous work, this method can discover non-linear directions in pretrained StyleGAN2 and BigGAN models. The discovered directions may be used for image manipulation.

Authors use the differences caused by an edit operation on the feature activations to optimize the identifiability of each direction. The edit operations are modeled by several separate neural nets βˆ†_i(z) and learning. Given a latent code z and its generated image x = G(z), we seek to find edit operations βˆ†_i(z) such that the image x' = G(βˆ†_i(z)) has semantically meaningful changes over x while still preserving the identity of x.


πŸ“ Paper
πŸ›  Code (next week)

#paper_tldr #cv #gan
Media is too big
VIEW IN TELEGRAM
Spectacular Image Stylization using CLIP and DALL-E

As a Style Transfer Dude, I can say that this is super cool. A statue of David by Michelangelo was used as an input image. Then it was morphed towards different styles of famous artists by steering the latent code towards the embeddings of a textual description in CLIP space.

I especially like Picasso's Cubism where it created a half-bull half-human portrait which is one of the typical sujets of Picasso. Rene Magritte's stylization is my second favorite.

I discussed similar techniques for image editing here and here.

πŸ“™Colab which contains the most significant parts to reproduce the results: link.

Original youtube video.
Thanks @NeuralShit for the pointer.

#image_gen #gan #style_transfer
Media is too big
VIEW IN TELEGRAM
Joker Donald Trump Inauguration SpeechπŸƒ

Look Ma, DeepFakes are getting amazingly good! No need to spend thousands of dollars anymore to create such realistic effects.

Borrowed from @NeuroLands
ReStyle: A Residual-Based StyleGAN Encoder via Iterative RefinementπŸ”₯

This
paper proposed an improved way to project real images in the StyleGAN latent space (which is required for further image manipulations).

Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate. The initial estimate is set to just average latent code across the dataset. Inverting is done using multiple of forward passes by iteratively feeding the encoder with the output of the previous step along with the original input.

Notably, during inference, ReStyle converges its inversion after a small number of steps (e.g., < 5), taking less than 0.5 seconds per image. This is compared to several minutes per image when inverting using optimization techniques.

The results are impressive! The L2 and LPIPS loss valeus are comparable to optimization-based techniques, while two orders of magnitude faster!

πŸ“ Paper
πŸ›  Code
πŸ‘« Colab
Media is too big
VIEW IN TELEGRAM
Monkey is playing Pong just using the power of its mind (no joystick)πŸ”₯

New demo from
Neuralink. A monkey called Pager is playing video games using brain signals for in-game manipulations.
I'm just curious how much more precise is invasive neuralink versus some non-invasive electroencephalography-based sensors?

Now imagine someone with paralysis using a smartphone/computer with their mind. This will be invaluable. I'm not even saying about controlling bionic arms and legs.
Forwarded from Neural Shit
Forwarded from Self Supervised Boy
Self-supervision paper from arxiv for histopathology CV.

Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.

Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task β€” network to predict id of the order permutation instead of predicting order itself.

Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.

All this allows them to reach quality increase even in high-data regime.

My description of the architecture and loss expanded here.
Source of the work here.
​​DetCon: The Self-supervised Contrastive Detection MethodπŸ₯½
DeepMind

A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.

Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).

🌟Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5Γ— fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.

βš™οΈ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.

Each image is randomly augmented twice, resulting in two images: x, x'.
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks {m}, {m'} which are aligned with the augmented images x, x'.

For every mask m associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.

Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views x and x'. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.
🦾 Main experiments

Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.

Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.

πŸ“ Paper: Efficient Visual Pretraining with Contrastive Detection