Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
Facebook AI has built TimeSformer, a new architecture for video understanding. Itโ€™s the first based exclusively on the self-attention mechanism used in Transformers. It outperforms the state of the art while being more efficient than 3D ConvNets for video.

โ“Why it matters
To train video-understanding models, the best 3D CNNs today can only use video segments that are a few seconds long. With TimeSformer, we are able to train on far longer video clips โ€” up to several minutes long. This may dramatically advance research to teach machines to understand complex long-form actions in videos, which is an important step for many AI applications geared toward human behavior understanding (e.g., an AI assistant).

Furthermore, the low inference cost of TimeSformer is an important step toward supporting future real-time video processing applications, such as AR/VR, or intelligent assistants that provide services based on video taken from wearable cameras.

๐ŸŒ FAIR Blog
๐Ÿ“ Paper
The well-known implementation-freak lucidrains has already released a โš™๏ธ Timesformer code.
You don't need EfficientNets. Simple tricks make ResNets better and faster than EfficientNets
Google Brain

Authors introduce a new family of ResNet architectures - ResNet-RS

๐Ÿ”ฅ Main Results
- ResNet-RSs are 1.7x - 2.7x faster than EfficientNets on TPUs, while achieving similar or better accuracies on ImageNet.
- In semi-supervised learning scenario (w/ 130M pseudo-labaled images) ResNet-RS achieves 86.2% top-1 ImageNet accuracy, while being 4.7x faster than EfficientNet-NoisyStudent
- SoTA results for transfer learning.

Continued below๐Ÿ‘‡
๐Ÿƒ They take advantage of the following ideas:
1. Convolutions are better optimized for GPUs/TPUs than depthwise convolutions used in EfficientNets.
2. Simple Scaling Strategy (i.e. Increasing the model dimensions like width, depth and resolution) is the key. Scale model depth in regimes where overfitting can occur:
๐Ÿ”ธDepth scaling outperforms width scaling for longer epoch regimes.
๐Ÿ”ธWidth scaling outperforms depth scaling for shorter epoch regimes.
3. Apply weight decay, label smoothing, dropout and stochastic depth for regularization.
4. Use RandAugment instead of AutoAugment.
5. Adding two common and simple architectural changes (Squeeze-and-Excitation and ResNet-D).
6. Decrease weight decay when using more regularization like dropout, augmentations, stochastic depth, etc..

โ“How to tune the hyperparameters?
1. Scaling strategies found in small-scale regimes (e.g. on small models or with few training epochs) fail to generalize to larger models or longer training iterations
2. Run a small subset of models across different scales, for the full training epochs, to gain intuition on which dimensions are the most useful across model scales.
3. Increase Image Resolution lower than previously recommended. Larger image resolutions often yield diminishing returns.

โš”๏ธFLOPs vs Latency
While FLOPs provide a hardware-agnostic metric for assessing computational demand, they may not be indicative of actual latency times for training and inference. In custom hardware architectures (e.g. TPUs and GPUs), FLOPs are an especially poor proxy because operations are often bounded by memory access costs and have different levels of optimization on modern matrix multiplication units. The inverted bottlenecks used in EfficientNets employ depthwise convolutions with large activations and have a small compute to memory ratio (operational intensity) compared to the ResNetโ€™s bottleneck blocks which employ dense convolutions on smaller activations. This makes EfficientNets less efficient ๐Ÿ˜‚ on modern accelerators compared to ResNets. A ResNet-RS model with 1.8x more FLOPs than EfficientNet-B6 is 2.7x faster on a TPUv3.

โš”๏ธ Parameters vs Memory
Although ResNet-RS has 3.8x more parameters and FLOPs than EfficeintNet with the same accuracy, the ResNet-RS model requires 2.3x less memory and runs ~3x faster on TPUs and GPUs.
Parameter count does not necessarily dictate memory consumption during training because memory is often dominated by the size of the activations. And EfficientNets has large activations which cause a larger memory footprint because EfficientNets requires large image resolutions to match the performance of the ResNet-RSs. E.g, to get 84% top-1 ImageNet accuracy, EficientNet needs an input image of 528x528, while ResNet-RS - only 256x256.

โ˜‘๏ธ Conclusions:
1. You'd better ResNets as baselines for your projects now.
2. Reporting latencies and memory consumption are generally more relevant metrics to compare different architectures, than the number of FLOPs. FLOPs and parameters are not representative of latency or memory consumption.
3. Training methods can be more task-specific than architectures. E.g., data augmentation is useful for small datasets or when training for many epochs, but the specifics of the augmentation method can be task-dependent (e.g. scale jittering instead of RandAugment is better on Kinetics-400 video classification).
4. The best performing scaling strategy depends on the training regime and whether overfitting is an issue. When training for 350 epochs on ImageNet, use depth scaling, whereas scaling the width is preferable when training for few epochs (e.g. only 10)
5.Future successful architectures will probably emerge by co-design with hardware, particularly in resource-tight regimes like mobile phones.

๐ŸŒ My blogpose at gdude.de
๐Ÿ“ Paper Revisiting ResNets: Improved Training and Scaling Strategies
๐Ÿ”จ Code (Tensorflow)

๐Ÿ“Ž Other references:
EfficientNet
ResNet-D (Bag of Tricks)
RandAugment
AutoAugment
Squeeze-and-Excitation
Future of human-computer interaction โ€” the 10-year vision by Facebook Reality Labs

Say you decide to walk to your local cafe to get some work done. Youโ€™re wearing a pair of AR glasses and a soft wristband. As you head out the door, your Assistant asks if youโ€™d like to listen to the latest episode of your favorite podcast. A small movement of your finger lets you click โ€œplay.โ€

As you enter the cafe, your Assistant asks, โ€œDo you want me to put in an order for a 12-ounce Americano?โ€ Not in the mood for your usual, you again flick your finger to click โ€œno.โ€

You head to a table, but instead of pulling out a laptop, you pull out a pair of soft, lightweight haptic gloves. When you put them on, a virtual screen and keyboard show up in front of you and you begin to edit a document. Typing is just as intuitive as typing on a physical keyboard and youโ€™re on a roll, but the noise from the cafe makes it hard to concentrate.

Read more about the vision of the future of HCI at Facebok Reality Labs (FRL) blogpost.
Media is too big
VIEW IN TELEGRAM
Ultra-low-friction AR interface will be built on two technological pillars:

1. Ultra-low-friction input, so when you need to act, the path from thought to action is as short and intuitive as possible. You might gesture with your hand, make voice commands, or select items from a menu by looking at them โ€” actions enabled by hand-tracking cameras, a microphone array, and eye-tracking technology.
But ultimately, youโ€™ll need a more natural way - neural input, e.g. wrist-based electromyography (EMG).
Wrist-based EMG reads the signals on the motor neurons that run from the spinal cord to the hand. The signals through the wrist are so clear that EMG can detect finger motion of just a millimeter. Ultimately it may even be possible to sense just the intent to move a finger.
2. The second pillar is the use of AI, context, and personalization to scope the effects of your input actions to your needs at any given moment. AI should adapt the input interface to the context/environment and, ideally, anticipate the user's needs.

I strongly recommend watching the Keynote talk by FRL Chief Scientist Michael Abrash. The FRL projects are very ambitious.
Continuing the discussion about novel Human-Computer Interfaces ๐Ÿฆพ

Technologies & Startups that Hack The Brain: Beyond the Healthcare Market
A review of 30 startups, their markets, business models, tech, and where machine learning fits in.

This article has a rather wide view on neurotech, and brain-computer interfaces (BCIs, both invasive and noninvasive) and various technologies, e.g. electroencephalography (EEG), electromyography (EMG), functional near-infrared spectroscopy (fNIRS), and others. It also covers neuromodulation that partially overlaps with the BCIs space.
This media is not supported in your browser
VIEW IN TELEGRAM
Gucci and Belarusian startup Wanna created virtual sneakers.
You can buy then at Gucci app for $12 or at Wanna Kicks app for $9 ๐Ÿคญ

I'm not a big fan of such applications. While I appreciate the efforts of the Wanna team - they went a long way since the last year and the shoes fit the foot much better now, but such sneakers still look a bit toyish in my opinion. To make the material look more realistic one would need to adapt the rendering to the current lighting conditions and shadows.

Would you use this app?

Video from @futuresailors.