Gradient Dude

Self-training Improves Pre-training for Natural Language Understanding
Facebook AI & Stanford

Most semi-supervised NLP approaches require specifically in-domain unlabeled data. It means that for the best results, the unlabeled portion of the data which we want to use for semi-supervised training must be from the same domain as the annotated dataset.

This paper proposes SenAugment - a method that constructs task-specific in-domain unannotated datasets on the fly from the large external bank of sentences. So for any new NLP task where we have only a small dataset, we don't need to bother anymore to collect a very similar unannotated dataset if we want to use semi-supervised training.
Now we can sort of cheat to improve the performance of an NLP model on almost any downstream task using Self-training (which is also called Teacher-Student training):
1. We retrieve the most relevant sentences (few millions of them) for the current downstream task from the external bank. For retrieval we use the embedding space of a sentence encoder - Transformer, pre-trained with masked language modeling and finetuned to maximize cosine similarity between similar sentences.
2. We train the teacher model - a RoBERTa-Large model finetuned on the downstream task.
3. Then we use a teacher model to annotate the retrieved unlabeled in-domain sentences. We perform additional filtering by keeping the ones that have the high-confident predictions.
4. As our student model, we then finetune a new RoBERTa-Large using KL-divergence on the synthetic data by considering the post-softmax class probabilities as labels (i.e., not only the most confident class but the entire class distribution is used as a label for every sentence).

Such a self-training procedure significantly boosts the performance compared to the baseline. And the positive effect is higher when fewer GT annotated sentences are available.

As a large-scale external bank of unannotated sentences, authors use CommonCrowl. In particular, they use a corpus with 5 billion sentences (100B words). Because of its scale and diversity, the sentence bank contains data from various domains and with different styles, allowing to retrieve relevant data for many downstream tasks. To retrieve the most relevant sentences for a specific downstream task, we need to obtain an embedding for the task. Several options exist: (1) average embeddings of all sentences in the training set; (2) average embeddings for every class; (3) keep original sentences embeddings.

📝 Paper
🛠 Code

#paper_explained #nlp

662 viewsedited 05:00

Gradient Dude

How to easily edit and compose images like in Photoshop using GANs?
MIT

🎯Task:
Given an incomplete image or a collage of images, generate a realistic image from it.

🔑Method:
This paper presents a simple approach – given a fixed pretrained generator (e.g., StyleGAN), they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable location of the latent space, which then the generator maps onto the
image manifold. Such an approach enables more localized editing of individual image parts compared to direct editing in the latent space

📚Interesting findings:
- Even though our regressor is never trained on unrealistic and incoherent collages, it projects the given image into a reasonable latent code.
- Authors show that the representation of the generator is already compositional in the latent code. Meaning that altering the part of the input image, will result in a change of the regressed latent code in the corresponding location.

➕Pros:
- As input, we need only a single example of approximately how we want the generated image to look (can be a collage of different images).
- Requires only one forward pass of the regressor and generator -> fast, unlike iterative optimization approaches that can require up to a minute to reconstruct an image. https://arxiv.org/abs/1911.11544
- Does not require any labeled attributes.

💬Applications
- Image inpainting.
- Example-based image editing (incoherent collage -> to realistic image).

#paper_explained #cv

📝 Paper: Using latent space regression to analyze and leverage compositionality in GANs
🌐 Project page
⚒ Code
📓 Colab

821 views07:01

Gradient Dude

Learning to resize: Replace a front-end resizer in deep networks by a learnable non-linear resizer
Google Research

Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.

Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.

This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.

Looking ahead: probably there’s a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.

📝 Paper
No code yet

#cv #paper_explained

846 views13:00

Gradient Dude

🔥New video on my YouTube channel!🔥
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"

🎯 Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.

❓ How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates (x,y,z), but not on the viewing angle. In contrast, basis functions depend only on the viewing angle and are the same for every pixel if the angle is fixed. Such angle and coordinates decoupling allows for caching all pixel representations which results in a 100x speedup of novel scene rendering (60FPS!). Moreover, the proposed scene parametrization allows the rendering of specular objects (non-Lambertian) with complex view-dependent effects.

✏️ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension H × W × 4 where the last dimension contains RGB values and alpha transparency values. These planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.

One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.

Regressing the color directly from the viewing angle v (and the pixel location [x,y,z]) with a neural network F(x, y, z, v), as is done in NERF, is very inefficient for real-time rendering as it requires to recompute every voxel in the volume for every new camera pose.

The key idea of the NEX method is to approximate this function F(x, y, z, v) with a linear combination of learnable basis functions {H_n(v): R^2 → R^{3x3}}.

To summarize, the modified MPI contains the following parameters per pixel: α, k0, k1 , . . . , k_N. These parameters are predicted by neural network f(x, y, z) for every pixel.

Another set of parameters -- global basis matrices H1(v) , H2(v), . . . , H_N(v) which are shared across all pixels but depend on the viewing angle v. The columns of H_n(v) are basis vectors of some color space different from RGB space. These basis matrices are predicted by another neural network g(v) = [H1(v) , H2(v), . . . , H_N(v)].

The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of f(x, y, z) for all coordinates. Therefore a novel view can be synthesized by just a single forward pass of network g(v), because f() does not depend on v and we don't need to recompute it.

Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions H_n(v) and view-independent parameters α and k_n, n=1...N.

▶️ Video explanation
🌐 NEX project page
📝 NEX paper
⏱ Realtime demo

💠 Multiplane Images (MPI)
💠 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp

YouTube

NeX: Real-time View Synthesis with Neural Basis Expansion + NERF [Paper explaned]

The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates…

968 views16:31

About

Blog

Apps

Platform