Gradient Dude

ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinement🔥

This paper proposed an improved way to project real images in the StyleGAN latent space (which is required for further image manipulations).

Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate. The initial estimate is set to just average latent code across the dataset. Inverting is done using multiple of forward passes by iteratively feeding the encoder with the output of the previous step along with the original input.

Notably, during inference, ReStyle converges its inversion after a small number of steps (e.g., < 5), taking less than 0.5 seconds per image. This is compared to several minutes per image when inverting using optimization techniques.

The results are impressive! The L2 and LPIPS loss valeus are comparable to optimization-based techniques, while two orders of magnitude faster!

📝 Paper
🛠 Code
👫 Colab

1.65K viewsedited 19:19

0:33

0:33

0:30

🌀 Project page with more results

1.7K views19:19

Monkey is playing Pong just using the power of its mind (no joystick)🔥

New demo from Neuralink. A monkey called Pager is playing video games using brain signals for in-game manipulations.
I'm just curious how much more precise is invasive neuralink versus some non-invasive electroencephalography-based sensors?

Now imagine someone with paralysis using a smartphone/computer with their mind. This will be invaluable. I'm not even saying about controlling bionic arms and legs.

2.08K views15:48

Forwarded from Neural Shit

1.55K views01:12

swanky-pleasure-bcf on Notion

Forwarded from Self Supervised Boy

Self-supervision paper from arxiv for histopathology CV.

Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.

Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task — network to predict id of the order permutation instead of predicting order itself.

Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.

All this allows them to reach quality increase even in high-data regime.

My description of the architecture and loss expanded here.
Source of the work here.

Self-supervised driven consistency training for annotation efficient histopathology image analysis | Notion

In this paper authors gain insight for the new loss from the way histopathologists work with images. Since the enormous scale of the images for histopathological research it is stored in pyramid-like structure with different zoom level, so researches tend…

1.58K views14:03

DetCon: The Self-supervised Contrastive Detection Method🥽
DeepMind

A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.

Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).

🌟Highlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5× fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.

⚙️ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.

Each image is randomly augmented twice, resulting in two images: x, x'.
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks {m}, {m'} which are aligned with the augmented images x, x'.

For every mask m associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.

Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views x and x'. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.

2.38K views16:31

🦾 Main experiments

Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.

Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.

📝 Paper: Efficient Visual Pretraining with Contrastive Detection

2.84K views16:31

I have disappeared for a couple of days and now I'm happy to announce that yesterday I defended my PhD in Computer Vision!🥳🍾

So more high quality posts are coming!

2.45K views18:46

0:03

Researchers from Berkeley rolled out VideoGPT - a transformer that generates videos.

The results are not super "WOW", but the architecture is quite simple and now it can be a starting point for all future work in this direction. As you know, GPT-3 for text generation was also not built right away. So let's will wait for method acceleration and quality improvement.

📝Paper
⚙️Code
🌐Project page
🃏Demo

18.9K views12:31

1:18

Infinite image generation and resampling 🔥

This method can generate infinite images of diverse and complex scenes that transition naturally from one into another. It does so without any conditioning and trains without any supervision from a dataset of unrelated square images.

You can check an interactive demo on the project website.

📝Paper

2.84K views10:00

Snap has released a new model for animating the entire human body (not just the face). Looks pretty good.

The principle is similar to their previous method - First order motion model for animation of heads. The difference is that (a) the background motion is explicitly modeled here; and (b) instead of regressing local affine transformations for a set of keypoints, this method learns to find heatmaps of different body parts in unsupervised way and
the transformation matrix of each body part is computed by applying principal component analysis (PCA) to the predicted heatmaps.

More details on the project website. Most importantly, there is code and pretrained weights. So go ahead and animate!

P.S. 2 years ago another method for animating the whole body "Everybody Dance Now" was released, but there you had to retrain the network for each new person.

2.49K views13:14