ββHow to easily edit and compose images like in Photoshop using GANs?
MIT
π―Task:
Given an incomplete image or a collage of images, generate a realistic image from it.
πMethod:
This paper presents a simple approach β given a fixed pretrained generator (e.g., StyleGAN), they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable location of the latent space, which then the generator maps onto the
image manifold. Such an approach enables more localized editing of individual image parts compared to direct editing in the latent space
πInteresting findings:
- Even though our regressor is never trained on unrealistic and incoherent collages, it projects the given image into a reasonable latent code.
- Authors show that the representation of the generator is already compositional in the latent code. Meaning that altering the part of the input image, will result in a change of the regressed latent code in the corresponding location.
βPros:
- As input, we need only a single example of approximately how we want the generated image to look (can be a collage of different images).
- Requires only one forward pass of the regressor and generator -> fast, unlike iterative optimization approaches that can require up to a minute to reconstruct an image. https://arxiv.org/abs/1911.11544
- Does not require any labeled attributes.
π¬Applications
- Image inpainting.
- Example-based image editing (incoherent collage -> to realistic image).
#paper_explained #cv
π Paper: Using latent space regression to analyze and leverage compositionality in GANs
π Project page
β Code
π Colab
MIT
π―Task:
Given an incomplete image or a collage of images, generate a realistic image from it.
πMethod:
This paper presents a simple approach β given a fixed pretrained generator (e.g., StyleGAN), they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable location of the latent space, which then the generator maps onto the
image manifold. Such an approach enables more localized editing of individual image parts compared to direct editing in the latent space
πInteresting findings:
- Even though our regressor is never trained on unrealistic and incoherent collages, it projects the given image into a reasonable latent code.
- Authors show that the representation of the generator is already compositional in the latent code. Meaning that altering the part of the input image, will result in a change of the regressed latent code in the corresponding location.
βPros:
- As input, we need only a single example of approximately how we want the generated image to look (can be a collage of different images).
- Requires only one forward pass of the regressor and generator -> fast, unlike iterative optimization approaches that can require up to a minute to reconstruct an image. https://arxiv.org/abs/1911.11544
- Does not require any labeled attributes.
π¬Applications
- Image inpainting.
- Example-based image editing (incoherent collage -> to realistic image).
#paper_explained #cv
π Paper: Using latent space regression to analyze and leverage compositionality in GANs
π Project page
β Code
π Colab
Learning to resize: Replace a front-end resizer in deep networks by a learnable non-linear resizer
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably thereβs a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
π Paper
No code yet
#cv #paper_explained
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably thereβs a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
π Paper
No code yet
#cv #paper_explained
π₯New video on my YouTube channel!π₯
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
π― Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
β How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
βοΈ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
The key idea of the NEX method is to approximate this function
To summarize, the modified MPI contains the following parameters per pixel:
Another set of parameters -- global basis matrices
The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
βΆοΈ Video explanation
π NEX project page
π NEX paper
β± Realtime demo
π Multiplane Images (MPI)
π NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
π― Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
β How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
(x,y,z)
, but not on the viewing angle. In contrast, basis functions depend only on the viewing angle and are the same for every pixel if the angle is fixed. Such angle and coordinates decoupling allows for caching all pixel representations which results in a 100x speedup of novel scene rendering (60FPS!). Moreover, the proposed scene parametrization allows the rendering of specular objects (non-Lambertian) with complex view-dependent effects.βοΈ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
H Γ W Γ 4
where the last dimension contains RGB values and alpha transparency values. These planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
v
(and the pixel location [x,y,z]
) with a neural network F(x, y, z, v)
, as is done in NERF, is very inefficient for real-time rendering as it requires to recompute every voxel in the volume for every new camera pose.The key idea of the NEX method is to approximate this function
F(x, y, z, v)
with a linear combination of learnable basis functions {H_n(v): R^2 β R^{3x3}}
.To summarize, the modified MPI contains the following parameters per pixel:
Ξ±, k0, k1 , . . . , k_N
. These parameters are predicted by neural network f(x, y, z)
for every pixel.Another set of parameters -- global basis matrices
H1(v) , H2(v), . . . , H_N(v)
which are shared across all pixels but depend on the viewing angle v
. The columns of H_n(v)
are basis vectors of some color space different from RGB space. These basis matrices are predicted by another neural network g(v) = [H1(v) , H2(v), . . . , H_N(v)]
.The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
f(x, y, z)
for all coordinates. Therefore a novel view can be synthesized by just a single forward pass of network g(v)
, because f()
does not depend on v
and we don't need to recompute it.Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
H_n(v)
and view-independent parameters Ξ±
and k_n
, n=1...N
.βΆοΈ Video explanation
π NEX project page
π NEX paper
β± Realtime demo
π Multiplane Images (MPI)
π NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
YouTube
NeX: Real-time View Synthesis with Neural Basis Expansion + NERF [Paper explaned]
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinatesβ¦
Forwarded from Data Science by ODS.ai π¦
ββEfficientNetV2: Smaller Models and Faster Training
A new paper from Google Brain with a new SOTA architecture called EfficientNetV2. The authors develop a new family of CNN models that are optimized both for accuracy and training speed. The main improvements are:
- an improved training-aware neural architecture search with new building blocks and ideas to jointly optimize training speed and parameter efficiency;
- a new approach to progressive learning that adjusts regularization along with the image size;
As a result, the new approach can reach SOTA results while training faster (up to 11x) and smaller (up to 6.8x).
Paper: https://arxiv.org/abs/2104.00298
Code will be available here:
https://github.com/google/automl/efficientnetv2
A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-effnetv2
#cv #sota #nas #deeplearning
A new paper from Google Brain with a new SOTA architecture called EfficientNetV2. The authors develop a new family of CNN models that are optimized both for accuracy and training speed. The main improvements are:
- an improved training-aware neural architecture search with new building blocks and ideas to jointly optimize training speed and parameter efficiency;
- a new approach to progressive learning that adjusts regularization along with the image size;
As a result, the new approach can reach SOTA results while training faster (up to 11x) and smaller (up to 6.8x).
Paper: https://arxiv.org/abs/2104.00298
Code will be available here:
https://github.com/google/automl/efficientnetv2
A detailed unofficial overview of the paper: https://andlukyane.com/blog/paper-review-effnetv2
#cv #sota #nas #deeplearning
ββSelf-supervised Learning for Medical images
Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned geometrically.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of image patches for self-supervised training.
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, random noise, random cut-outs) and train a decoder to restore the original view.
However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts. To make sure that the images are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.
Authors show that their method is significantly better than other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting image rotations). But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.
π Paper
π Code & Models
#paper_tldr #cv #self_supervised
Due to fixed imaging procedures, medical images like X-ray or CT scans are usually well aligned geometrically.
This gives an opportunity to utilize such an alignment to automatically mine similar pairs of image patches for self-supervised training.
The basic idea is to fix K random locations in the unlabeled medical images (K locations are the same for every image) and crop image patches across different images (which correspond to scans of different patients).
Now we create a surrogate classification task by assigning a unique pseudo-label to every location 1...K.
Authors combine the surrogate classification task with image restoration using a denoising autoencoder: they randomly perturb the cropped patches (color jittering, random noise, random cut-outs) and train a decoder to restore the original view.
However, sometimes the alignment between medical images is not perfect by default and images may depict different body parts. To make sure that the images are aligned, we train an autoencoder on full images (before cropping) and select only similar images by comparing the distances between them in the learned autoencoder latent space.
Authors show that their method is significantly better than other self-supervised learning approaches on medical data and can even be combined with existing self-supervised methods like RotNet (predicting image rotations). But unfortunately, the comparison is rather limited, and they didn't compare to Jigsaw Puzzle, SwaV, or recent contrastive self-supervised methods like MoCO, BYOL, and SimCLR.
π Paper
π Code & Models
#paper_tldr #cv #self_supervised
LatentCLR: A Contrastive Learning Approach for Unsupervised Discovery of Interpretable Directions
A framework that learns meaningful directions in GANs' latent space using unsupervised contrastive learning. Instead of discovering fixed directions such as in previous work, this method can discover non-linear directions in pretrained StyleGAN2 and BigGAN models. The discovered directions may be used for image manipulation.
Authors use the differences caused by an edit operation on the feature activations to optimize the identifiability of each direction. The edit operations are modeled by several separate neural nets
π Paper
π Code (next week)
#paper_tldr #cv #gan
A framework that learns meaningful directions in GANs' latent space using unsupervised contrastive learning. Instead of discovering fixed directions such as in previous work, this method can discover non-linear directions in pretrained StyleGAN2 and BigGAN models. The discovered directions may be used for image manipulation.
Authors use the differences caused by an edit operation on the feature activations to optimize the identifiability of each direction. The edit operations are modeled by several separate neural nets
β_i(z)
and learning. Given a latent code z
and its generated image x = G(z)
, we seek to find edit operations β_i(z)
such that the image x' = G(β_i(z))
has semantically meaningful changes over x
while still preserving the identity of x
.π Paper
π Code (next week)
#paper_tldr #cv #gan