This media is not supported in your browser
VIEW IN TELEGRAM
I totally need glasses that move with my eyebrows. (c) Yann LeCun
The quality is wicked because of the pesky twitter compression.
The quality is wicked because of the pesky twitter compression.
CLIP + StyleGAN. Searching in StyleGAN latent space using description embedded with CLIP.
Queries: "A pony that looks like Beyonce", "... like Billie Eilish", ".. like Rihanna"
π The basic idea
Generate an image with StyleGAN and pass the image to CLIP for the loss against a CLIP text query representation. You then backprop through both networks and optimize a latent space in StyleGAN.
π€¬ Drawbacks 1) it only works on text it knows 2) needs some cherry picking, only about 1/5 are really good.
Source twitt.
Queries: "A pony that looks like Beyonce", "... like Billie Eilish", ".. like Rihanna"
π The basic idea
Generate an image with StyleGAN and pass the image to CLIP for the loss against a CLIP text query representation. You then backprop through both networks and optimize a latent space in StyleGAN.
π€¬ Drawbacks 1) it only works on text it knows 2) needs some cherry picking, only about 1/5 are really good.
Source twitt.
This media is not supported in your browser
VIEW IN TELEGRAM
Cute RoboCat π learned how to track objects. Fun application of Computer Vision. Is anyone among my subscribers working in robotics?
Source: IG @bio.makers
Source: IG @bio.makers
GANs are getting their way into production
Adobe has rolled out a super-resolution feature for Photoshop. Now one can upscale the image x2 times on each side.
π For curious, I leave several links to SOTA super-resolution methods:
1. Structure-Preserving Super Resolution with Gradient Guidance (SPSR), CVPR2020
2. Learned Image Downscaling for Upscaling using Content Adaptive Resampler (CAR), ECCV2020
3. Single Image Super-Resolution via a Holistic Attention Network (HAN), ECCV2020
4. ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic, CVPR2021
β
Let me know in comments if there is a better super-res paper.
Adobe has rolled out a super-resolution feature for Photoshop. Now one can upscale the image x2 times on each side.
π For curious, I leave several links to SOTA super-resolution methods:
1. Structure-Preserving Super Resolution with Gradient Guidance (SPSR), CVPR2020
2. Learned Image Downscaling for Upscaling using Content Adaptive Resampler (CAR), ECCV2020
3. Single Image Super-Resolution via a Holistic Attention Network (HAN), ECCV2020
4. ClassSR: A General Framework to Accelerate Super-Resolution Networks by Data Characteristic, CVPR2021
β
Let me know in comments if there is a better super-res paper.
Summary of Recent Generative Models
Nice blogpost giving a brief overview over several recent generative models, including VAEe, GANs and Diffusion Models.
π Read it here
Nice blogpost giving a brief overview over several recent generative models, including VAEe, GANs and Diffusion Models.
π Read it here
Aran Komatsuzaki
State-of-the-Art Image Generative Models
I have aggregated some of the SotA image generative models released recently, with short summaries, visualizations and comments. The overall development is summarized, and the future trends are speβ¦
Facebook AI has built TimeSformer, a new architecture for video understanding. Itβs the first based exclusively on the self-attention mechanism used in Transformers. It outperforms the state of the art while being more efficient than 3D ConvNets for video.
βWhy it matters
To train video-understanding models, the best 3D CNNs today can only use video segments that are a few seconds long. With TimeSformer, we are able to train on far longer video clips β up to several minutes long. This may dramatically advance research to teach machines to understand complex long-form actions in videos, which is an important step for many AI applications geared toward human behavior understanding (e.g., an AI assistant).
Furthermore, the low inference cost of TimeSformer is an important step toward supporting future real-time video processing applications, such as AR/VR, or intelligent assistants that provide services based on video taken from wearable cameras.
π FAIR Blog
π Paper
βWhy it matters
To train video-understanding models, the best 3D CNNs today can only use video segments that are a few seconds long. With TimeSformer, we are able to train on far longer video clips β up to several minutes long. This may dramatically advance research to teach machines to understand complex long-form actions in videos, which is an important step for many AI applications geared toward human behavior understanding (e.g., an AI assistant).
Furthermore, the low inference cost of TimeSformer is an important step toward supporting future real-time video processing applications, such as AR/VR, or intelligent assistants that provide services based on video taken from wearable cameras.
π FAIR Blog
π Paper
The well-known implementation-freak lucidrains has already released a βοΈ Timesformer code.
Fabulous DeepDream straight outta Mexico. The ambient sounds are well adjusted as well.
Who said that DeepDream is useless? π
Who said that DeepDream is useless? π
You don't need EfficientNets. Simple tricks make ResNets better and faster than EfficientNets
Google Brain
Authors introduce a new family of ResNet architectures - ResNet-RS
π₯ Main Results
- ResNet-RSs are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar or better accuracies on ImageNet.
- In semi-supervised learning scenario (w/ 130M pseudo-labaled images) ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet-NoisyStudent
- SoTA results for transfer learning.
Continued belowπ
Google Brain
Authors introduce a new family of ResNet architectures - ResNet-RS
π₯ Main Results
- ResNet-RSs are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar or better accuracies on ImageNet.
- In semi-supervised learning scenario (w/ 130M pseudo-labaled images) ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet-NoisyStudent
- SoTA results for transfer learning.
Continued belowπ
π They take advantage of the following ideas:
1. Convolutions are better optimized for GPUs/TPUs than depthwise convolutions used in EfficientNets.
2. Simple Scaling Strategy (i.e. Increasing the model dimensions like width, depth and resolution) is the key. Scale model depth in regimes where overfitting can occur:
πΈDepth scaling outperforms width scaling for longer epoch regimes.
πΈWidth scaling outperforms depth scaling for shorter epoch regimes.
3. Apply weight decay, label smoothing, dropout and stochastic depth for regularization.
4. Use RandAugment instead of AutoAugment.
5. Adding two common and simple architectural changes (Squeeze-and-Excitation and ResNet-D).
6. Decrease weight decay when using more regularization like dropout, augmentations, stochastic depth, etc..
βHow to tune the hyperparameters?
1. Scaling strategies found in small-scale regimes (e.g. on small models or with few training epochs) fail to generalize to larger models or longer training iterations
2. Run a small subset of models across different scales, for the full training epochs, to gain intuition on which dimensions are the most useful across model scales.
3. Increase Image Resolution lower than previously recommended. Larger image resolutions often yield diminishing returns.
βοΈFLOPs vs Latency
While FLOPs provide a hardware-agnostic metric for assessing computational demand, they may not be indicative of actual latency times for training and inference. In custom hardware architectures (e.g. TPUs and GPUs), FLOPs are an especially poor proxy because operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units. The inverted bottlenecks used in EfficientNets employ depthwise convolutions with large activations and have a small compute to memory ratio (operational intensity) compared to the ResNetβs bottleneck blocks which employ dense convolutions on smaller activations. This makes EfficientNets less efficient π on modern accelerators compared to ResNets. A ResNet-RS model with 1.8x more FLOPs than EfficientNet-B6 is 2.7x faster on a TPUv3.
βοΈ Parameters vs Memory
Although ResNet-RS has 3.8x more parameters and FLOPs than EfficeintNet with the same accuracy, the ResNet-RS model requires 2.3x less memory and runs ~3x faster on TPUs and GPUs.
Parameter count does not necessarily dictate memory consumption during training because memory is often dominated by the size of the activations. And EfficientNets has large activations which cause a larger memory footprint because EfficientNets requires large image resolutions to match the performance of the ResNet-RSs. E.g, to get 84% top-1 ImageNet accuracy, EficientNet needs an input image of 528x528, while ResNet-RS - only 256x256.
βοΈ Conclusions:
1. You'd better ResNets as baselines for your projects now.
2. Reporting latencies and memory consumption are generally more relevant metrics to compare different architectures, than the number of FLOPs. FLOPs and parameters are not representative of latency or memory consumption.
3. Training methods can be more task-specific than architectures. E.g., data augmentation is useful for small datasets or when training for many epochs, but the specifics of the augmentation method can be task-dependent (e.g. scale jittering instead of RandAugment is better on Kinetics-400 video classification).
4. The best performing scaling strategy depends on the training regime and whether overfitting is an issue. When training for 350 epochs on ImageNet, use depth scaling, whereas scaling the width is preferable when training for few epochs (e.g. only 10)
5.Future successful architectures will probably emerge by co-design with hardware, particularly in resource-tight regimes like mobile phones.
π My blogpose at gdude.de
π Paper Revisiting ResNets: Improved Training and Scaling Strategies
π¨ Code (Tensorflow)
π Other references:
EfficientNet
ResNet-D (Bag of Tricks)
RandAugment
AutoAugment
Squeeze-and-Excitation
1. Convolutions are better optimized for GPUs/TPUs than depthwise convolutions used in EfficientNets.
2. Simple Scaling Strategy (i.e. Increasing the model dimensions like width, depth and resolution) is the key. Scale model depth in regimes where overfitting can occur:
πΈDepth scaling outperforms width scaling for longer epoch regimes.
πΈWidth scaling outperforms depth scaling for shorter epoch regimes.
3. Apply weight decay, label smoothing, dropout and stochastic depth for regularization.
4. Use RandAugment instead of AutoAugment.
5. Adding two common and simple architectural changes (Squeeze-and-Excitation and ResNet-D).
6. Decrease weight decay when using more regularization like dropout, augmentations, stochastic depth, etc..
βHow to tune the hyperparameters?
1. Scaling strategies found in small-scale regimes (e.g. on small models or with few training epochs) fail to generalize to larger models or longer training iterations
2. Run a small subset of models across different scales, for the full training epochs, to gain intuition on which dimensions are the most useful across model scales.
3. Increase Image Resolution lower than previously recommended. Larger image resolutions often yield diminishing returns.
βοΈFLOPs vs Latency
While FLOPs provide a hardware-agnostic metric for assessing computational demand, they may not be indicative of actual latency times for training and inference. In custom hardware architectures (e.g. TPUs and GPUs), FLOPs are an especially poor proxy because operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units. The inverted bottlenecks used in EfficientNets employ depthwise convolutions with large activations and have a small compute to memory ratio (operational intensity) compared to the ResNetβs bottleneck blocks which employ dense convolutions on smaller activations. This makes EfficientNets less efficient π on modern accelerators compared to ResNets. A ResNet-RS model with 1.8x more FLOPs than EfficientNet-B6 is 2.7x faster on a TPUv3.
βοΈ Parameters vs Memory
Although ResNet-RS has 3.8x more parameters and FLOPs than EfficeintNet with the same accuracy, the ResNet-RS model requires 2.3x less memory and runs ~3x faster on TPUs and GPUs.
Parameter count does not necessarily dictate memory consumption during training because memory is often dominated by the size of the activations. And EfficientNets has large activations which cause a larger memory footprint because EfficientNets requires large image resolutions to match the performance of the ResNet-RSs. E.g, to get 84% top-1 ImageNet accuracy, EficientNet needs an input image of 528x528, while ResNet-RS - only 256x256.
βοΈ Conclusions:
1. You'd better ResNets as baselines for your projects now.
2. Reporting latencies and memory consumption are generally more relevant metrics to compare different architectures, than the number of FLOPs. FLOPs and parameters are not representative of latency or memory consumption.
3. Training methods can be more task-specific than architectures. E.g., data augmentation is useful for small datasets or when training for many epochs, but the specifics of the augmentation method can be task-dependent (e.g. scale jittering instead of RandAugment is better on Kinetics-400 video classification).
4. The best performing scaling strategy depends on the training regime and whether overfitting is an issue. When training for 350 epochs on ImageNet, use depth scaling, whereas scaling the width is preferable when training for few epochs (e.g. only 10)
5.Future successful architectures will probably emerge by co-design with hardware, particularly in resource-tight regimes like mobile phones.
π My blogpose at gdude.de
π Paper Revisiting ResNets: Improved Training and Scaling Strategies
π¨ Code (Tensorflow)
π Other references:
EfficientNet
ResNet-D (Bag of Tricks)
RandAugment
AutoAugment
Squeeze-and-Excitation
Googleblog
Improving Deep Learning Performance with AutoAugment