Gradient Dude

High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
CVPR 2018
https://arxiv.org/abs/1711.11585

❓ What?
Synthesize HR (2048x1024) photo-realistic images from semantic label maps using GANs. Apply to street views, indoor scenes, and faces.

✏ Main points:
- Use ResNet-based architecture for a generator.
- Multi-resolution pipeline: train 2 generators. The first generator G_1 produced 1024x512 image. The second generator G_2 produces 2048x1024 image, but the output of the last feature layer of G_1 is element-wise summed with the output of one of the intermediate layers of G_2.
After training of G_1, they fix it and train G_2. This helps to integrate the global information from G_1 to G_2.
After G_2 is trained they jointly fine-tune all the networks together.
- Multi-scale discriminators. They use 3 discriminators which have identical architecture, but their weights are not shared.
Each discriminator operates at different image scale: the first gets the original image, the second and the third get downsampled images by a factor of 2 and 4 correspondingly.
- LSGAN (Mao et al., 2017) objective function.
- Feature loss based on the features extracted from the layers of the 3 discriminators (in the same spirit as the perceptual loss in Johnson et al., 2016).
- VGG feature loss (Johnson et al., 2016).

💊 Extra trick:
They use not just semantic label maps, but instance-level semantic label maps, which contain a unique object ID for each individual object.
- Train another encoder-decoder network to reconstruct images. Compute encoder features for every instance and use instance-wise average pooling to compute the average feature for the object instance. The average feature is then broadcast to all the pixel locations of the instance. Let's denote E(x) the average feature map produced in this way for input image x.
- When training the generator (G_1 or G_2) uses not only a semantic map label map as input, but concatenate to it E(x) as extra channels. Train jointly generator and E.
- After training extract E(x) features for all instances in the
training images and record the obtained features. Perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for a specific style, for example, the asphalt or cobblestone texture for a road.
- At inference time, randomly pick one of the cluster centers and use it as the encoded features. These features are concatenated with the label map and used as the input to the generator G.

✔ Experiments:
Compared to pix2pix (Isola et al., 2017) and CRN (Chen et al., 2017) and showed better results.
Good ablation studies.

⛔ Critics:
Not clear if a feature loss based on discriminators' features gives any improvement.

📎 Take home message:
Multi-resolution pipeline + Multiscale discriminators are good.
The trick with instance-level semantic label maps allows interactive image editing + capturing different modes during training.

720 viewsArtem Sanakoev, edited 02:44