Media is too big
VIEW IN TELEGRAM
Joker Donald Trump Inauguration Speechπ
Look Ma, DeepFakes are getting amazingly good! No need to spend thousands of dollars anymore to create such realistic effects.
Borrowed from @NeuroLands
Look Ma, DeepFakes are getting amazingly good! No need to spend thousands of dollars anymore to create such realistic effects.
Borrowed from @NeuroLands
ReStyle: A Residual-Based StyleGAN Encoder via Iterative Refinementπ₯
This paper proposed an improved way to project real images in the StyleGAN latent space (which is required for further image manipulations).
Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate. The initial estimate is set to just average latent code across the dataset. Inverting is done using multiple of forward passes by iteratively feeding the encoder with the output of the previous step along with the original input.
Notably, during inference, ReStyle converges its inversion after a small number of steps (e.g., < 5), taking less than 0.5 seconds per image. This is compared to several minutes per image when inverting using optimization techniques.
The results are impressive! The L2 and LPIPS loss valeus are comparable to optimization-based techniques, while two orders of magnitude faster!
π Paper
π Code
π« Colab
This paper proposed an improved way to project real images in the StyleGAN latent space (which is required for further image manipulations).
Instead of directly predicting the latent code of a given real image using a single pass, the encoder is tasked with predicting a residual with respect to the current estimate. The initial estimate is set to just average latent code across the dataset. Inverting is done using multiple of forward passes by iteratively feeding the encoder with the output of the previous step along with the original input.
Notably, during inference, ReStyle converges its inversion after a small number of steps (e.g., < 5), taking less than 0.5 seconds per image. This is compared to several minutes per image when inverting using optimization techniques.
The results are impressive! The L2 and LPIPS loss valeus are comparable to optimization-based techniques, while two orders of magnitude faster!
π Paper
π Code
π« Colab
Media is too big
VIEW IN TELEGRAM
Monkey is playing Pong just using the power of its mind (no joystick)π₯
New demo from Neuralink. A monkey called Pager is playing video games using brain signals for in-game manipulations.
I'm just curious how much more precise is invasive neuralink versus some non-invasive electroencephalography-based sensors?
Now imagine someone with paralysis using a smartphone/computer with their mind. This will be invaluable. I'm not even saying about controlling bionic arms and legs.
New demo from Neuralink. A monkey called Pager is playing video games using brain signals for in-game manipulations.
I'm just curious how much more precise is invasive neuralink versus some non-invasive electroencephalography-based sensors?
Now imagine someone with paralysis using a smartphone/computer with their mind. This will be invaluable. I'm not even saying about controlling bionic arms and legs.
Forwarded from Self Supervised Boy
Self-supervision paper from arxiv for histopathology CV.
Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.
Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task β network to predict id of the order permutation instead of predicting order itself.
Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.
All this allows them to reach quality increase even in high-data regime.
My description of the architecture and loss expanded here.
Source of the work here.
Authors draw inspiration from the process of how histopathologists tend to review the images, and how those images are stored. Histopathology images are multiscale slices of enormous size (tens of thousands pixels by one side), and area experts constantly move through different levels of magnification to keep in mind both fine and coarse structures of the tissue.
Therefore, in this paper the loss is proposed to capture relation between different magnification levels. Authors propose to train network to order concentric patches by their magnification level. They organise it as the classification task β network to predict id of the order permutation instead of predicting order itself.
Also, authors proposed specific architecture for this task and appended self-training procedure, as it was shown to boost results even after pre-training.
All this allows them to reach quality increase even in high-data regime.
My description of the architecture and loss expanded here.
Source of the work here.
Π―ΡΠΎΡΠ»Π°Π²'s Notion on Notion
Self-supervised driven consistency training for annotation efficient histopathology image analysis | Notion
In this paper authors gain insight for the new loss from the way histopathologists work with images. Since the enormous scale of the images for histopathological research it is stored in pyramid-like structure with different zoom level, so researches tendβ¦
ββDetCon: The Self-supervised Contrastive Detection Methodπ₯½
DeepMind
A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.
Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).
πHighlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5Γ fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.
βοΈ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.
Each image is randomly augmented twice, resulting in two images:
In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks
For every mask
Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.
Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views
DeepMind
A new self-supervised objective, contrastive detection, which tasks representations with identifying object-level features across augmentations.
Object-based regions are identified with an approximate, automatic segmentation algorithm based on pixel affinity (bottom). These masks are carried through two stochastic data augmentations and a convolutional feature extractor, creating groups of feature vectors in each view (middle). The contrastive detection objective then pulls together pooled feature vectors from the same mask (across views) and pushes apart features from different masks and different images (top).
πHighlights
+ SOTA detection and Instance Segmentation (on COCO) and Semantic Segmentation results (on PASCAL) when pretrained in self-supervised regime on ImageNet, while requiring up to 5Γ fewer epochs than SimCLR.
+ It also outperforms supervised pretraining on Imagenet.
+ DetCon(SimCLR) converges much faster to reach SOTA: 200 epochs are sufficient to surpass supervised transfer to COCO, and 500 to PASCAL.
+ Linear increase in the number of model parameters (using ResNet-101, ResNet-152, and ResNet-200) brings a linear increase in the accuracy on downstream tasks.
+ Despite only being trained on ImageNet, DetCon(BYOL) matches the performance of Facebook's SEER model that used a higher capacity RegNet architecture and was pretrained on 1 Billion Instagram images.
+ First time a ResNet-50 with self-supervised pretraining on COCO outperforms the supervised pretraining for Transfer to PASCAL
+ The power of DetCon strongly correlates with the quality of the masks. The better the masks used during the self-supervised pretraining stage, the better the accuracy on downstream tasks.
βοΈ Method details
DetConS and DetConB, based on two recent self-supervised baselines: SimCLR and BYOL respectively with ResNet-50 backbone.
Authors adopt the data augmentation procedure and network architecture from these methods while applying the proposed Contrastive Detection loss to each.
Each image is randomly augmented twice, resulting in two images:
x, x'
.In addition, they compute for each image a set of masks that segment the image into different components.
These masks can be computed using efficient, off-the-shelf, unsupervised segmentation algorithms. In particular, authors use Felzenszwalb-Huttenlocher algorithm a classic segmentation procedure that iteratively merges regions using pixel-based affinity. This algorithm does not require any training and is available in scikit-image. If available, human-annotated segmentations can also be used instead of automatically generated. Each mask (represented as a binary image) is transformed using the same cropping and resizing as used for the underlying RGB image, resulting in two sets of masks
{m}, {m'}
which are aligned with the augmented images x, x'
.For every mask
m
associated with the image, authors compute a mask-pooled hidden vector (i.e., similar to regular average pooling but applied only to spatial locations belonging to the same mask).Then 2-layer MLP is used as a projection on top of the mask-pooled hidden vectors. Note that if you replace masked-pooling with a single global average pooling then you will get exactly SimCLR or BYOL architecture.
Standard contrastive loss based on cross-entropy is used for learning. Positive pair is the latent representations of the same mask from augmented views
x
and x'
. Latent representations of different masks from the same image and from different images in the batch are used as negative samples. Moreover, negative masks are allowed to overlap with a positive one.π¦Ύ Main experiments
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
π Paper: Efficient Visual Pretraining with Contrastive Detection
Pretrain on Imagenet -> finetune on COCO or PASCAL:
1. Pretrain on Imagenet in a self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 12 epochs on COCO or 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 5x fewer pretraining epochs than SimCLR.
Pretrain on COCO -> finetune on PASCAL for Semantic Segmentation task:
1. Pretrain on COCO in self-supervised regime using the proposed DetCon approach.
2. Use the self-supervised pretraining of the backbone to initialize Mask-RCNN and fine-tune it with GT labels for 45 epochs on PASCAL (Semantic Segmentation).
3. Achieve SOTA results while using 4x fewer pretraining epochs than SimCLR.
5. The first time a self-supervised pretrained ResNet-50 backbone outperforms supervised pretraining on COCO.
π Paper: Efficient Visual Pretraining with Contrastive Detection
This media is not supported in your browser
VIEW IN TELEGRAM
Researchers from Berkeley rolled out VideoGPT - a transformer that generates videos.
The results are not super "WOW", but the architecture is quite simple and now it can be a starting point for all future work in this direction. As you know, GPT-3 for text generation was also not built right away. So let's will wait for method acceleration and quality improvement.
πPaper
βοΈCode
πProject page
πDemo
The results are not super "WOW", but the architecture is quite simple and now it can be a starting point for all future work in this direction. As you know, GPT-3 for text generation was also not built right away. So let's will wait for method acceleration and quality improvement.
πPaper
βοΈCode
πProject page
πDemo
Infinite image generation and resampling π₯
This method can generate infinite images of diverse and complex scenes that transition naturally from one into another. It does so without any conditioning and trains without any supervision from a dataset of unrelated square images.
You can check an interactive demo on the project website.
πPaper
This method can generate infinite images of diverse and complex scenes that transition naturally from one into another. It does so without any conditioning and trains without any supervision from a dataset of unrelated square images.
You can check an interactive demo on the project website.
πPaper
This media is not supported in your browser
VIEW IN TELEGRAM
Snap has released a new model for animating the entire human body (not just the face). Looks pretty good.
The principle is similar to their previous method - First order motion model for animation of heads. The difference is that (a) the background motion is explicitly modeled here; and (b) instead of regressing local affine transformations for a set of keypoints, this method learns to find heatmaps of different body parts in unsupervised way and
the transformation matrix of each body part is computed by applying principal component analysis (PCA) to the predicted heatmaps.
More details on the project website. Most importantly, there is code and pretrained weights. So go ahead and animate!
P.S. 2 years ago another method for animating the whole body "Everybody Dance Now" was released, but there you had to retrain the network for each new person.
The principle is similar to their previous method - First order motion model for animation of heads. The difference is that (a) the background motion is explicitly modeled here; and (b) instead of regressing local affine transformations for a set of keypoints, this method learns to find heatmaps of different body parts in unsupervised way and
the transformation matrix of each body part is computed by applying principal component analysis (PCA) to the predicted heatmaps.
More details on the project website. Most importantly, there is code and pretrained weights. So go ahead and animate!
P.S. 2 years ago another method for animating the whole body "Everybody Dance Now" was released, but there you had to retrain the network for each new person.
Moore's law is still working. Yesterday IBM has announced that they created the first 2nm chip!
They claim that their 2nm development will improve performance by 45% at the same power, or 75% energy at the same performance, compared to modern 7nm processors (e.g., Intel's).
IBM is one of the worldβs leading research centers on future semiconductor technology, but they have sold its manufacturing to GlobalFoundries in 2014 so currently, IBM only develops IP in collaboration with others (Samsung and recently announced Intel) for their manufacturing facilities.
The latest NVIDIA GPUs based on Ampere microarchitecture (2020) use TSMC 7 nm fabrication process. TSMC's 3nm is already entering into production in 2022. But when is IBM/Intel's 2nm even coming? I'm also curious if Intel can even manage their 5nm chips by 2024/25.
Source article.
They claim that their 2nm development will improve performance by 45% at the same power, or 75% energy at the same performance, compared to modern 7nm processors (e.g., Intel's).
IBM is one of the worldβs leading research centers on future semiconductor technology, but they have sold its manufacturing to GlobalFoundries in 2014 so currently, IBM only develops IP in collaboration with others (Samsung and recently announced Intel) for their manufacturing facilities.
The latest NVIDIA GPUs based on Ampere microarchitecture (2020) use TSMC 7 nm fabrication process. TSMC's 3nm is already entering into production in 2022. But when is IBM/Intel's 2nm even coming? I'm also curious if Intel can even manage their 5nm chips by 2024/25.
Source article.
Another cool work from OpenAI: Diffusion Models Beat GANs on Image Synthesis.
New SOTA for image generation on ImageNet
A new type of generative models is proposed - Diffusion Probabilistic Model. The diffusion model is a parameterized Markov chain trained using variational inference to generate samples matching data after finite time. The diffusion process here is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. So here we are learning reverse transitions in this chain, which reverse the diffusion process. And of course, we parameterize everything with neural networks.
It produces very high-quality generations, even better than with GANs (it is especially clearly seen on the man with a fish, who is not that spectacular in the BigGAN model). The current disadvantage of diffusion models is slow training and inference.
π Paper
βοΈ Code
New SOTA for image generation on ImageNet
A new type of generative models is proposed - Diffusion Probabilistic Model. The diffusion model is a parameterized Markov chain trained using variational inference to generate samples matching data after finite time. The diffusion process here is a Markov chain that gradually adds noise to the data in the opposite direction of sampling until signal is destroyed. So here we are learning reverse transitions in this chain, which reverse the diffusion process. And of course, we parameterize everything with neural networks.
It produces very high-quality generations, even better than with GANs (it is especially clearly seen on the man with a fish, who is not that spectacular in the BigGAN model). The current disadvantage of diffusion models is slow training and inference.
π Paper
βοΈ Code