Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
Channel created
Channel name was changed to «arxiv_tldr»
High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
CVPR 2018
https://arxiv.org/abs/1711.11585

What?
Synthesize HR (2048x1024) photo-realistic images from semantic label maps using GANs. Apply to street views, indoor scenes, and faces.

Main points:
- Use ResNet-based architecture for a generator.
- Multi-resolution pipeline: train 2 generators. The first generator G_1 produced 1024x512 image. The second generator G_2 produces 2048x1024 image, but the output of the last feature layer of G_1 is element-wise summed with the output of one of the intermediate layers of G_2.
After training of G_1, they fix it and train G_2. This helps to integrate the global information from G_1 to G_2.
After G_2 is trained they jointly fine-tune all the networks together.
- Multi-scale discriminators. They use 3 discriminators which have identical architecture, but their weights are not shared.
Each discriminator operates at different image scale: the first gets the original image, the second and the third get downsampled images by a factor of 2 and 4 correspondingly.
- LSGAN (Mao et al., 2017) objective function.
- Feature loss based on the features extracted from the layers of the 3 discriminators (in the same spirit as the perceptual loss in Johnson et al., 2016).
- VGG feature loss (Johnson et al., 2016).

💊 Extra trick:
They use not just semantic label maps, but instance-level semantic label maps, which contain a unique object ID for each individual object.
- Train another encoder-decoder network to reconstruct images. Compute encoder features for every instance and use instance-wise average pooling to compute the average feature for the object instance. The average feature is then broadcast to all the pixel locations of the instance. Let's denote E(x) the average feature map produced in this way for input image x.
- When training the generator (G_1 or G_2) uses not only a semantic map label map as input, but concatenate to it E(x) as extra channels. Train jointly generator and E.
- After training extract E(x) features for all instances in the
training images and record the obtained features. Perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for a specific style, for example, the asphalt or cobblestone texture for a road.
- At inference time, randomly pick one of the cluster centers and use it as the encoded features. These features are concatenated with the label map and used as the input to the generator G.

Experiments:
Compared to pix2pix (Isola et al., 2017) and CRN (Chen et al., 2017) and showed better results.
Good ablation studies.

Critics:
Not clear if a feature loss based on discriminators' features gives any improvement.

📎 Take home message:
Multi-resolution pipeline + Multiscale discriminators are good.
The trick with instance-level semantic label maps allows interactive image editing + capturing different modes during training.
Channel photo updated
Learning Linear Transformations for Fast Arbitrary Style Transfer
https://arxiv.org/abs/1808.04537
Not published but looks like CVPR submission, 2018.

What?
The paper is a follow-up of the WCT ("Universal Style Transfer via Feature Transforms", Yang et al., NIPS 2017).
The basic idea is to transfer second order statistics from a style image onto a content image via a multiplication between content image features and a transformation matrix. In the WCT paper authors calculated the transformation matrix with a pre-determined algorithm. In this paper, they propose to use a neural network to produce the desired transformation matrix based on a pair of style and content images.

Short algorithm explanation:
1. Train VGG-based autoencoder on the content images' dataset.
2. Add a transformation module in the bottleneck, which will take content image features and style image features and produce a transformation matrix.
3. Apply the transformation to the content image features, feed the transformed features to the decoder and get a stylization.
Steps 2 and 3 are trained with widely used Gram matrix losses on pretrained VGG features (Gatys et al., 2015).

📢 What is the claimed benefit?
1. Using several different layers of VGG network to compute Gram matrix loss is the commonly adopted technique in style transfer.
In case of WCT, one would have to do k forward passes through the stylization network to use k different layers of VGG.
On contrary, the proposed method will learn to model statistics of those k layers in a single transformation matrix, which is more efficient.
2. After training a stylization can be produced with any previously unseen style image in a single forward pass.
3. The learned transformation allows the usage of a shallower encoder.
4. It's is fast- 40 FPS on 512x512 images with shallow encoder and 28 FPS with a deeper one.
5. More stable video stylization in a frame-wise manner, w/o temporal context.
4. According to the provided figures, they improved stylization quality compared to WCT and AdaIn (Huang et al., ICCV 2017).

Experiments:
- Standard style transfer experiments, but not many methods which they compared to.
- Video stylization.
- Photo-realistic stylization (like day to night).
- Game to real (GTA images to photos).

Criticism:
- Poor comparisons to the existing methods. I have a strong impression that images are cherry-picked to show the cases where improvement is visible.
- Doubtful possibility to apply the method to images larger than 512x512 pix.

🔚 Take home:
Matching second order statistics between content image and style image can be modeled by a linear transformation, which is generated by the module. This gives a fast and generalizable approach for style transfer.

🔻 Links:
[1] WCT https://arxiv.org/abs/1705.08086
[2] AdaIn https://arxiv.org/abs/1703.06868
[3] Gatys et al., 2015, https://arxiv.org/abs/1508.06576
Channel name was changed to «compvis_tldr»
​​Everybody Dance Now
https://arxiv.org/abs/1808.07371
Arxiv, 22 Aug 2018 (perhaps submitted to SIGGRAPH)

What?
Given a video of a source person and another of a target person the method can generate a new video of the target person enacting the same motions as the source. This is achieved by means of Pix2PixHD model + Pose Estimation + temporal coherence loss + extra generator for faces.
Pix2PixHD[1] is "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", which I described 2 posts earlier.

✏️ Method:
Three-stage approach: pose detection, pose normalization, and mapping from normalized pose stick figures to the target subject.
1. Pose estimation: Apply a pretrained pose estimation model (OpenPose[2]) to every frame of the input and output videos. Draw a representation of the pose for every frame as a stickman on a white background. So, for every frame y we have a corresponding stickman image x.
2. Train Pix2PixHD generator G to generate a target person image G(x) given a stickman x as input.
Discriminator D attempts to distinguish between 'real' image pairs (x, y) and 'fake' pairs (x, G(x)).
3. Vanilla Pix2PixHD model works on single frames, but we want to have a temporal coherence between consecutive frames. Authors propose to generate a t-th frame G(y_t) using a corresponding stickman image x_t and a previously generated frame G(x_t-1). In this case discriminator tries to discern a 'fake' sequence (x_t-1, x_t, G(x_t-1)) from a 'real' sequence (x_t-1, x_t, y_t-1, y_t).
4. To improve the quality of human faces, authors add a specialized GAN designed to add more details to the face region. It generates a cropped-out face given a cropped-out head region of the stickman.
After training a full image generator G, authors input a cropped-out face and a corresponding region of the stickman to the face generator G_f which outputs a residual. This residual is then added to the previously generated full image to impove face realism.

◼️ Training is done in two stages:
1. Train image generator G and discriminator D, freeze their weights afterward.
2. Train a face generator G_f along with the face discriminator D_f.

◼️ Pose transfer from source video to a target person:
1. Source stickmen are normalized to match position and scale of the target person poses.
2. Frame-by frame input normalized source stickman images to generators G, G_f and get a target person doing the same movements as a source.

✔️ Experiments:
Authors test their method on the dancing videos collected on the internet as a source and their own videos as a target.

💬 Discussion:
Overall the method shows compelling results of a target person dancing in the same way as some other person does.
But it's not perfect. Self ocllusions of the person are not rendered properly (for example, limbs can disappear).
Target persons were deliberately filmed in tight clothes with minimal wrinkling since the pose representation does not encode information about clothes. So it may not work on people wearing arbitrary apparel. Another problem pointed out by the authors is video jittering when the input motion or motion speed is different from the movements seen at training time.

Links:
[1] https://arxiv.org/pdf/1711.11585.pdf
[2] https://github.com/CMU-Perceptual-Computing-Lab/openpose