High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs
CVPR 2018
https://arxiv.org/abs/1711.11585
โ What?
Synthesize HR (2048x1024) photo-realistic images from semantic label maps using GANs. Apply to street views, indoor scenes, and faces.
โ Main points:
- Use ResNet-based architecture for a generator.
- Multi-resolution pipeline: train 2 generators. The first generator G_1 produced 1024x512 image. The second generator G_2 produces 2048x1024 image, but the output of the last feature layer of G_1 is element-wise summed with the output of one of the intermediate layers of G_2.
After training of G_1, they fix it and train G_2. This helps to integrate the global information from G_1 to G_2.
After G_2 is trained they jointly fine-tune all the networks together.
- Multi-scale discriminators. They use 3 discriminators which have identical architecture, but their weights are not shared.
Each discriminator operates at different image scale: the first gets the original image, the second and the third get downsampled images by a factor of 2 and 4 correspondingly.
- LSGAN (Mao et al., 2017) objective function.
- Feature loss based on the features extracted from the layers of the 3 discriminators (in the same spirit as the perceptual loss in Johnson et al., 2016).
- VGG feature loss (Johnson et al., 2016).
๐ Extra trick:
They use not just semantic label maps, but instance-level semantic label maps, which contain a unique object ID for each individual object.
- Train another encoder-decoder network to reconstruct images. Compute encoder features for every instance and use instance-wise average pooling to compute the average feature for the object instance. The average feature is then broadcast to all the pixel locations of the instance. Let's denote E(x) the average feature map produced in this way for input image x.
- When training the generator (G_1 or G_2) uses not only a semantic map label map as input, but concatenate to it E(x) as extra channels. Train jointly generator and E.
- After training extract E(x) features for all instances in the
training images and record the obtained features. Perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for a specific style, for example, the asphalt or cobblestone texture for a road.
- At inference time, randomly pick one of the cluster centers and use it as the encoded features. These features are concatenated with the label map and used as the input to the generator G.
โ Experiments:
Compared to pix2pix (Isola et al., 2017) and CRN (Chen et al., 2017) and showed better results.
Good ablation studies.
โ Critics:
Not clear if a feature loss based on discriminators' features gives any improvement.
๐ Take home message:
Multi-resolution pipeline + Multiscale discriminators are good.
The trick with instance-level semantic label maps allows interactive image editing + capturing different modes during training.
CVPR 2018
https://arxiv.org/abs/1711.11585
โ What?
Synthesize HR (2048x1024) photo-realistic images from semantic label maps using GANs. Apply to street views, indoor scenes, and faces.
โ Main points:
- Use ResNet-based architecture for a generator.
- Multi-resolution pipeline: train 2 generators. The first generator G_1 produced 1024x512 image. The second generator G_2 produces 2048x1024 image, but the output of the last feature layer of G_1 is element-wise summed with the output of one of the intermediate layers of G_2.
After training of G_1, they fix it and train G_2. This helps to integrate the global information from G_1 to G_2.
After G_2 is trained they jointly fine-tune all the networks together.
- Multi-scale discriminators. They use 3 discriminators which have identical architecture, but their weights are not shared.
Each discriminator operates at different image scale: the first gets the original image, the second and the third get downsampled images by a factor of 2 and 4 correspondingly.
- LSGAN (Mao et al., 2017) objective function.
- Feature loss based on the features extracted from the layers of the 3 discriminators (in the same spirit as the perceptual loss in Johnson et al., 2016).
- VGG feature loss (Johnson et al., 2016).
๐ Extra trick:
They use not just semantic label maps, but instance-level semantic label maps, which contain a unique object ID for each individual object.
- Train another encoder-decoder network to reconstruct images. Compute encoder features for every instance and use instance-wise average pooling to compute the average feature for the object instance. The average feature is then broadcast to all the pixel locations of the instance. Let's denote E(x) the average feature map produced in this way for input image x.
- When training the generator (G_1 or G_2) uses not only a semantic map label map as input, but concatenate to it E(x) as extra channels. Train jointly generator and E.
- After training extract E(x) features for all instances in the
training images and record the obtained features. Perform a K-means clustering on these features for each semantic category. Each cluster thus encodes the features for a specific style, for example, the asphalt or cobblestone texture for a road.
- At inference time, randomly pick one of the cluster centers and use it as the encoded features. These features are concatenated with the label map and used as the input to the generator G.
โ Experiments:
Compared to pix2pix (Isola et al., 2017) and CRN (Chen et al., 2017) and showed better results.
Good ablation studies.
โ Critics:
Not clear if a feature loss based on discriminators' features gives any improvement.
๐ Take home message:
Multi-resolution pipeline + Multiscale discriminators are good.
The trick with instance-level semantic label maps allows interactive image editing + capturing different modes during training.
Learning Linear Transformations for Fast Arbitrary Style Transfer
https://arxiv.org/abs/1808.04537
Not published but looks like CVPR submission, 2018.
โ What?
The paper is a follow-up of the WCT ("Universal Style Transfer via Feature Transforms", Yang et al., NIPS 2017).
The basic idea is to transfer second order statistics from a style image onto a content image via a multiplication between content image features and a transformation matrix. In the WCT paper authors calculated the transformation matrix with a pre-determined algorithm. In this paper, they propose to use a neural network to produce the desired transformation matrix based on a pair of style and content images.
โ Short algorithm explanation:
1. Train VGG-based autoencoder on the content images' dataset.
2. Add a transformation module in the bottleneck, which will take content image features and style image features and produce a transformation matrix.
3. Apply the transformation to the content image features, feed the transformed features to the decoder and get a stylization.
Steps 2 and 3 are trained with widely used Gram matrix losses on pretrained VGG features (Gatys et al., 2015).
๐ข What is the claimed benefit?
1. Using several different layers of VGG network to compute Gram matrix loss is the commonly adopted technique in style transfer.
In case of WCT, one would have to do k forward passes through the stylization network to use k different layers of VGG.
On contrary, the proposed method will learn to model statistics of those k layers in a single transformation matrix, which is more efficient.
2. After training a stylization can be produced with any previously unseen style image in a single forward pass.
3. The learned transformation allows the usage of a shallower encoder.
4. It's is fast- 40 FPS on 512x512 images with shallow encoder and 28 FPS with a deeper one.
5. More stable video stylization in a frame-wise manner, w/o temporal context.
4. According to the provided figures, they improved stylization quality compared to WCT and AdaIn (Huang et al., ICCV 2017).
โ Experiments:
- Standard style transfer experiments, but not many methods which they compared to.
- Video stylization.
- Photo-realistic stylization (like day to night).
- Game to real (GTA images to photos).
โ Criticism:
- Poor comparisons to the existing methods. I have a strong impression that images are cherry-picked to show the cases where improvement is visible.
- Doubtful possibility to apply the method to images larger than 512x512 pix.
๐ Take home:
Matching second order statistics between content image and style image can be modeled by a linear transformation, which is generated by the module. This gives a fast and generalizable approach for style transfer.
๐ป Links:
[1] WCT https://arxiv.org/abs/1705.08086
[2] AdaIn https://arxiv.org/abs/1703.06868
[3] Gatys et al., 2015, https://arxiv.org/abs/1508.06576
https://arxiv.org/abs/1808.04537
Not published but looks like CVPR submission, 2018.
โ What?
The paper is a follow-up of the WCT ("Universal Style Transfer via Feature Transforms", Yang et al., NIPS 2017).
The basic idea is to transfer second order statistics from a style image onto a content image via a multiplication between content image features and a transformation matrix. In the WCT paper authors calculated the transformation matrix with a pre-determined algorithm. In this paper, they propose to use a neural network to produce the desired transformation matrix based on a pair of style and content images.
โ Short algorithm explanation:
1. Train VGG-based autoencoder on the content images' dataset.
2. Add a transformation module in the bottleneck, which will take content image features and style image features and produce a transformation matrix.
3. Apply the transformation to the content image features, feed the transformed features to the decoder and get a stylization.
Steps 2 and 3 are trained with widely used Gram matrix losses on pretrained VGG features (Gatys et al., 2015).
๐ข What is the claimed benefit?
1. Using several different layers of VGG network to compute Gram matrix loss is the commonly adopted technique in style transfer.
In case of WCT, one would have to do k forward passes through the stylization network to use k different layers of VGG.
On contrary, the proposed method will learn to model statistics of those k layers in a single transformation matrix, which is more efficient.
2. After training a stylization can be produced with any previously unseen style image in a single forward pass.
3. The learned transformation allows the usage of a shallower encoder.
4. It's is fast- 40 FPS on 512x512 images with shallow encoder and 28 FPS with a deeper one.
5. More stable video stylization in a frame-wise manner, w/o temporal context.
4. According to the provided figures, they improved stylization quality compared to WCT and AdaIn (Huang et al., ICCV 2017).
โ Experiments:
- Standard style transfer experiments, but not many methods which they compared to.
- Video stylization.
- Photo-realistic stylization (like day to night).
- Game to real (GTA images to photos).
โ Criticism:
- Poor comparisons to the existing methods. I have a strong impression that images are cherry-picked to show the cases where improvement is visible.
- Doubtful possibility to apply the method to images larger than 512x512 pix.
๐ Take home:
Matching second order statistics between content image and style image can be modeled by a linear transformation, which is generated by the module. This gives a fast and generalizable approach for style transfer.
๐ป Links:
[1] WCT https://arxiv.org/abs/1705.08086
[2] AdaIn https://arxiv.org/abs/1703.06868
[3] Gatys et al., 2015, https://arxiv.org/abs/1508.06576
โโEverybody Dance Now
https://arxiv.org/abs/1808.07371
Arxiv, 22 Aug 2018 (perhaps submitted to SIGGRAPH)
โ What?
Given a video of a source person and another of a target person the method can generate a new video of the target person enacting the same motions as the source. This is achieved by means of Pix2PixHD model + Pose Estimation + temporal coherence loss + extra generator for faces.
Pix2PixHD[1] is "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", which I described 2 posts earlier.
โ๏ธ Method:
Three-stage approach: pose detection, pose normalization, and mapping from normalized pose stick figures to the target subject.
1. Pose estimation: Apply a pretrained pose estimation model (OpenPose[2]) to every frame of the input and output videos. Draw a representation of the pose for every frame as a stickman on a white background. So, for every frame
2. Train Pix2PixHD generator
Discriminator
3. Vanilla Pix2PixHD model works on single frames, but we want to have a temporal coherence between consecutive frames. Authors propose to generate a
4. To improve the quality of human faces, authors add a specialized GAN designed to add more details to the face region. It generates a cropped-out face given a cropped-out head region of the stickman.
After training a full image generator
โผ๏ธ Training is done in two stages:
1. Train image generator
2. Train a face generator
โผ๏ธ Pose transfer from source video to a target person:
1. Source stickmen are normalized to match position and scale of the target person poses.
2. Frame-by frame input normalized source stickman images to generators
โ๏ธ Experiments:
Authors test their method on the dancing videos collected on the internet as a source and their own videos as a target.
๐ฌ Discussion:
Overall the method shows compelling results of a target person dancing in the same way as some other person does.
But it's not perfect. Self ocllusions of the person are not rendered properly (for example, limbs can disappear).
Target persons were deliberately filmed in tight clothes with minimal wrinkling since the pose representation does not encode information about clothes. So it may not work on people wearing arbitrary apparel. Another problem pointed out by the authors is video jittering when the input motion or motion speed is different from the movements seen at training time.
Links:
[1] https://arxiv.org/pdf/1711.11585.pdf
[2] https://github.com/CMU-Perceptual-Computing-Lab/openpose
https://arxiv.org/abs/1808.07371
Arxiv, 22 Aug 2018 (perhaps submitted to SIGGRAPH)
โ What?
Given a video of a source person and another of a target person the method can generate a new video of the target person enacting the same motions as the source. This is achieved by means of Pix2PixHD model + Pose Estimation + temporal coherence loss + extra generator for faces.
Pix2PixHD[1] is "High-Resolution Image Synthesis and Semantic Manipulation with Conditional GANs", which I described 2 posts earlier.
โ๏ธ Method:
Three-stage approach: pose detection, pose normalization, and mapping from normalized pose stick figures to the target subject.
1. Pose estimation: Apply a pretrained pose estimation model (OpenPose[2]) to every frame of the input and output videos. Draw a representation of the pose for every frame as a stickman on a white background. So, for every frame
y
we have a corresponding stickman image x
.2. Train Pix2PixHD generator
G
to generate a target person image G(x)
given a stickman x
as input. Discriminator
D
attempts to distinguish between 'real' image pairs (x, y)
and 'fake' pairs (x, G(x))
.3. Vanilla Pix2PixHD model works on single frames, but we want to have a temporal coherence between consecutive frames. Authors propose to generate a
t
-th frame G(y_t)
using a corresponding stickman image x_t
and a previously generated frame G(x_t-1)
. In this case discriminator tries to discern a 'fake' sequence (x_t-1, x_t, G(x_t-1))
from a 'real' sequence (x_t-1, x_t, y_t-1, y_t)
.4. To improve the quality of human faces, authors add a specialized GAN designed to add more details to the face region. It generates a cropped-out face given a cropped-out head region of the stickman.
After training a full image generator
G
, authors input a cropped-out face and a corresponding region of the stickman to the face generator G_f
which outputs a residual. This residual is then added to the previously generated full image to impove face realism. โผ๏ธ Training is done in two stages:
1. Train image generator
G
and discriminator D
, freeze their weights afterward. 2. Train a face generator
G_f
along with the face discriminator D_f
. โผ๏ธ Pose transfer from source video to a target person:
1. Source stickmen are normalized to match position and scale of the target person poses.
2. Frame-by frame input normalized source stickman images to generators
G
, G_f
and get a target person doing the same movements as a source.โ๏ธ Experiments:
Authors test their method on the dancing videos collected on the internet as a source and their own videos as a target.
๐ฌ Discussion:
Overall the method shows compelling results of a target person dancing in the same way as some other person does.
But it's not perfect. Self ocllusions of the person are not rendered properly (for example, limbs can disappear).
Target persons were deliberately filmed in tight clothes with minimal wrinkling since the pose representation does not encode information about clothes. So it may not work on people wearing arbitrary apparel. Another problem pointed out by the authors is video jittering when the input motion or motion speed is different from the movements seen at training time.
Links:
[1] https://arxiv.org/pdf/1711.11585.pdf
[2] https://github.com/CMU-Perceptual-Computing-Lab/openpose
I'm attending ECCV right now and must share this with you.
Great talk from Erik Learned-Miller on unsupervised learning using depth prediction as a surrogate task. The paper: https://t.co/BeaSLoQAPh
Improving detector's perfomance by unsupervised hard examples mining.
Paper: https://t.co/3P4NX2dds5
Great talk from Erik Learned-Miller on unsupervised learning using depth prediction as a surrogate task. The paper: https://t.co/BeaSLoQAPh
Improving detector's perfomance by unsupervised hard examples mining.
Paper: https://t.co/3P4NX2dds5
X2Face: A network for controlling face generation using images, audio, and pose codes
Olivia Wiles*, A. Sophia Koepke*, Andrew Zisserman
https://arxiv.org/abs/1807.10550
ECCV 2018
โ Briefly
Authors proposed a model that can control a
โ๏ธ Method:
The model consists of 2 networks: embedding net and driving net.
1. Given a source frame embedding net predicts a vector field which after applying to the source frame gives a so-called embedded face (which essentially is a frontalized face).
2. Given a
Driving net has an encoder-decoder architecture (U-Net and pix2pix based). The latent space (i.e. bottleneck features, called the driving vector) of this network encodes pose/expression/zoom/other factors of variation.
Note: The trick is to avoid translating an input image to output image directly on the pixel level (as for example CycleGAN does), but to predict a vector field which will transform the input image to something new. This allows training w/o a discriminator and explicit pose/expression labels.
โผ๏ธ Training is done in two steps:
First step:
Authors sample two random frames from the same video (same person). One is used as the source frame, another as the
Second step:
The new loss function is introduced: identity loss, which compares frames using features of the pre-trained network (VGG-11) for person identification.
Authors sample a source frame
For a pair
โผ๏ธ Inference step:
The pose and expression of the
Additionally, authors show that the transformation of the
Analogously authors train a mapping from 0.2 s sound excerpt to the driving vector (bottleneck features of the driving net). The resultant driving vector can be used to transform the source face in the same way as described before.
โ๏ธ Experiments:
The model is trained on VoxCeleb dataset on cropped faces of size 256x256 pix.
The model is compared to CycleGAN and Averbuch-Elor et al. [1]. The proposed method obviously produces better results than CycleGAN. Compared to [1], X2Face makes fewer assumptions about the input data (e.g. the
Olivia Wiles*, A. Sophia Koepke*, Andrew Zisserman
https://arxiv.org/abs/1807.10550
ECCV 2018
โ Briefly
Authors proposed a model that can control a
source face
given a driving face
to produce a generated frame with the same identity as the source face
but the pose and expression from the driving face
. The model is trained in a self-supervised manner on a large collection of video data. They also show that the generation process can be driven by audio (person speaking) or pose codes w/o further network training. No 3D models used, the method works on 2D frames.โ๏ธ Method:
The model consists of 2 networks: embedding net and driving net.
1. Given a source frame embedding net predicts a vector field which after applying to the source frame gives a so-called embedded face (which essentially is a frontalized face).
2. Given a
driving frame
driving net predicts a vector field (i.e. x,y shift for every pixel of the input image). This vector field transforms pixels from an embedded face to produce the generated frame.Driving net has an encoder-decoder architecture (U-Net and pix2pix based). The latent space (i.e. bottleneck features, called the driving vector) of this network encodes pose/expression/zoom/other factors of variation.
Note: The trick is to avoid translating an input image to output image directly on the pixel level (as for example CycleGAN does), but to predict a vector field which will transform the input image to something new. This allows training w/o a discriminator and explicit pose/expression labels.
โผ๏ธ Training is done in two steps:
First step:
Authors sample two random frames from the same video (same person). One is used as the source frame, another as the
driving frame
. In this case, the loss function will be a photometric L1 loss between the driving frame
and the generated frame (since the source frame has the same identity as the driving frame
).Second step:
The new loss function is introduced: identity loss, which compares frames using features of the pre-trained network (VGG-11) for person identification.
Authors sample a source frame
sA
of identity A
, and two driving frame
s dA
, dR
. dA
is of identity A
and dR
a random identity. sA
, dA
, dR
are used as training inputs. This gives two generated frames g_dA
and g_dR
respectively which should both be of identity A
.For a pair
(dA, g_dA)
photometric L1 loss + identity loss is used; for a pair (dA, g_dR)
only identity loss is used.โผ๏ธ Inference step:
The pose and expression of the
source face
can be changed by feeding to the model a driving frame
with the face of any other person.Additionally, authors show that the transformation of the
source face
can be done using a pose code (encodes face attributes such as pitch/yaw/roll angles, for which the ground truth is provided in the dataset). They train a 1-layer neural network to convert pose code into a driving vector (bottleneck features of the driving net) and use the driving net to generate a corresponding driving vector field from the driving vector.Analogously authors train a mapping from 0.2 s sound excerpt to the driving vector (bottleneck features of the driving net). The resultant driving vector can be used to transform the source face in the same way as described before.
โ๏ธ Experiments:
The model is trained on VoxCeleb dataset on cropped faces of size 256x256 pix.
The model is compared to CycleGAN and Averbuch-Elor et al. [1]. The proposed method obviously produces better results than CycleGAN. Compared to [1], X2Face makes fewer assumptions about the input data (e.g. the
source face
should not necessarily be in a frontal pose with the neutral expression, etc.) and can handle larger pose changes. In contrary to [1], X2Face can work with a single driving frame
and does not require a driving video.โโโ Criticism:
- The resolution of the generated face is only 256x256.
- Generated frames appear to be blurred (not sharp), due to missing fine details on the face. Adding a discriminator might help to get more realistic renderings.
๐ฌ Conclusion:
I talked to Olivia (the first author) and overall it is a very nice work. They propose an interesting self-supervised method, which produces vector fields to transform a source face into a generated face with new pose and expression but the same appearance.
Links:
[1] Bringing Portraits to Life, SIGGRAPH Asia 2017, http://cs.tau.ac.il/~averbuch1/portraitslife/elor2017_bringingPortraits.pdf
- The resolution of the generated face is only 256x256.
- Generated frames appear to be blurred (not sharp), due to missing fine details on the face. Adding a discriminator might help to get more realistic renderings.
๐ฌ Conclusion:
I talked to Olivia (the first author) and overall it is a very nice work. They propose an interesting self-supervised method, which produces vector fields to transform a source face into a generated face with new pose and expression but the same appearance.
Links:
[1] Bringing Portraits to Life, SIGGRAPH Asia 2017, http://cs.tau.ac.il/~averbuch1/portraitslife/elor2017_bringingPortraits.pdf
โโTransferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova (Facebook AI Research)
In CVPR 2020.
๐https://asanakoy.github.io/densepose-evolution/
โถ๏ธyoutu.be/OU3Ayg_l4QM
๐https://arxiv.org/pdf/2003.00080.pdf
โ What?
DensePose approach predicts the pose of humans densely and accurately given a large dataset of poses annotated in detail.
We want to extend the same approach to animals but without annotations. Because it's super expensive to collect DensePose annotations for all different classes of animals. So we show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in DensePose for humans. We propose to utilize the existing annotations of humans and do self-training on unlabeled images of animals.
In a nutshell, we first pretrain the DensePose on the existing human annotations. Then we predict DensePose on unlabeled images, select the most confident predictions and throw them in the augmented training set for retraining the model. To be able to select point-wise the most confident DensePose predictions we introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel.
We tested several techniques for sampling pseudo-labels and concluded that sampling based on confidence estimates from fine-grained tasks (24-Body-part estimation and DensePose UV-maps) results in the best performance.
We introduced a novel DensePose-Chimps dataset with Dense Pose ground truth annotations for chimps and tested our models on it, obtaining significant performance improvement over the baseline.
In this paper, we conducted thorough experiments only for chimps, but the method can be extended to other animals like cats and dogs as well.
โ๏ธ More details:
1. To transfer DensePose from humans to animals we need a reference 3D model of an animal. Let's suppose we got an artist-created 3D model of the desired animal. The next step is to establish a dense mapping between the 3D model of animal and 3D model of a human. This is necessary to unify the evaluation protocols between humans and animals and allows to transfer of knowledge and annotations between different species. The matching between 3D models is done by matching semantic descriptors of the vertices on the meshes.
2. Our goal is to develop a DensePose predictor for a new class. Such a predictor must detect the object via a bounding box, segment it from the background, and obtain the Dense-Pose chart and UV-map coordinates for each foreground pixel. To do this we introduce a multi-head R-CNN architecture that combines multiple recognition tasks within a single model.
The first head refines the coordinates of the bounding box. The second head computes a foreground-background segmentation mask in the same way as MaskR-CNN. The third and the final head computes a part segmentation mask I, assigning each pixel to one of the 24-body parts charts, and the UV-map values for each foreground pixel.
3. We have a few existing instance-segmentation and detection annotations for some animals in the COCO dataset. Let's use them! Given a target animal class, let's say chimps. We want to find an optimal support domain: We find such classes from the COCO dataset pretraining on which gives the best detection (or segmentation) performance on the holdout set of chimps.
4. We jointly train DensePose prediction for people and detection, segmentation for other classes in the support domain. The goal is always to only build a model for the final target class โ we found that merging classes is an effective way of integrating information. So all support domain categories are merged in one and the training is done in a class-agnostic manner.
5. Now we have our baseline network which knows a lot about humans and a bit about the detection and segmentation of animals. We run this model over ~5Tb of videos from camera traps in the wild and select around 100k video frames with good detections. N
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova (Facebook AI Research)
In CVPR 2020.
๐https://asanakoy.github.io/densepose-evolution/
โถ๏ธyoutu.be/OU3Ayg_l4QM
๐https://arxiv.org/pdf/2003.00080.pdf
โ What?
DensePose approach predicts the pose of humans densely and accurately given a large dataset of poses annotated in detail.
We want to extend the same approach to animals but without annotations. Because it's super expensive to collect DensePose annotations for all different classes of animals. So we show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in DensePose for humans. We propose to utilize the existing annotations of humans and do self-training on unlabeled images of animals.
In a nutshell, we first pretrain the DensePose on the existing human annotations. Then we predict DensePose on unlabeled images, select the most confident predictions and throw them in the augmented training set for retraining the model. To be able to select point-wise the most confident DensePose predictions we introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel.
We tested several techniques for sampling pseudo-labels and concluded that sampling based on confidence estimates from fine-grained tasks (24-Body-part estimation and DensePose UV-maps) results in the best performance.
We introduced a novel DensePose-Chimps dataset with Dense Pose ground truth annotations for chimps and tested our models on it, obtaining significant performance improvement over the baseline.
In this paper, we conducted thorough experiments only for chimps, but the method can be extended to other animals like cats and dogs as well.
โ๏ธ More details:
1. To transfer DensePose from humans to animals we need a reference 3D model of an animal. Let's suppose we got an artist-created 3D model of the desired animal. The next step is to establish a dense mapping between the 3D model of animal and 3D model of a human. This is necessary to unify the evaluation protocols between humans and animals and allows to transfer of knowledge and annotations between different species. The matching between 3D models is done by matching semantic descriptors of the vertices on the meshes.
2. Our goal is to develop a DensePose predictor for a new class. Such a predictor must detect the object via a bounding box, segment it from the background, and obtain the Dense-Pose chart and UV-map coordinates for each foreground pixel. To do this we introduce a multi-head R-CNN architecture that combines multiple recognition tasks within a single model.
The first head refines the coordinates of the bounding box. The second head computes a foreground-background segmentation mask in the same way as MaskR-CNN. The third and the final head computes a part segmentation mask I, assigning each pixel to one of the 24-body parts charts, and the UV-map values for each foreground pixel.
3. We have a few existing instance-segmentation and detection annotations for some animals in the COCO dataset. Let's use them! Given a target animal class, let's say chimps. We want to find an optimal support domain: We find such classes from the COCO dataset pretraining on which gives the best detection (or segmentation) performance on the holdout set of chimps.
4. We jointly train DensePose prediction for people and detection, segmentation for other classes in the support domain. The goal is always to only build a model for the final target class โ we found that merging classes is an effective way of integrating information. So all support domain categories are merged in one and the training is done in a class-agnostic manner.
5. Now we have our baseline network which knows a lot about humans and a bit about the detection and segmentation of animals. We run this model over ~5Tb of videos from camera traps in the wild and select around 100k video frames with good detections. N
โโow we aim to utilize DensePose pseudo-labels obtained on the unlabeled frames for retraining the network.
6. To be able to select good point-wise predictions our model has to know how to estimate it's uncertainty for every pixel and for every task which we are solving. We introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel and every task. We propose to model: (a) Classification uncertainty (for object classification and segmentation) using temperature scaling in the softmax layer and; (b) regression uncertainty (for bounding box proposals and DensePose UV-maps) by prediction of a Gaussian distribution instead of a single target value. The higher the predicted variance the higher the uncertainty.
7. Now given pixel-wise uncertainties we can sample for the second round of training only those foreground points from the selected 100k points which have the highest confidence. We have experimented with different sampling strategies and show that sampling based on the confidences from fine-grained tasks (24-way body part segmentation, UV-maps) results in the bests performance.
8. The network retrained on the augmented data (existing human annotations + pseudo-labeled animals) show a significant performance boost on the hold out manually annotated DensePose-Chimps dataset.
The video demonstration of the self-trained model: youtu.be/OU3Ayg_l4QM
9. We also show that the proposed Auto-Calibrated RCNN improves over the baseline even without self-training on standard DensePose-Human, detection and segmentation tasks. This is due to the higher robustness of the proposed model to unseen data distributions at test time.
โ๏ธConclusion:
- Studied the problem of extending dense body pose recognition to animal species and suggested that doing this at scale requires learning from unlabelled data;
- demonstrated that existing detection, segmentation, and dense pose labeling models can transfer very well to a proximal animal class such as chimpanzee despite significant inter-class differences;
- introduced Auto-Calibrated DensePose-RCNN which can estimate the uncertainty of its predictions;
- introduced novel DensePose-Chips dataset for benchmarking dense pose prediction for Chimpanzees;
- showed that substantial improvements can be obtained by carefully selecting which categories to use to pre-train the model, by using a class-agnostic architecture to integrate different sources of information;
- and by modeling labeling uncertainty to grade pseudo-labels for self-training;
- achieved excellent performance without using a single labeled image of the target class for training.
6. To be able to select good point-wise predictions our model has to know how to estimate it's uncertainty for every pixel and for every task which we are solving. We introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel and every task. We propose to model: (a) Classification uncertainty (for object classification and segmentation) using temperature scaling in the softmax layer and; (b) regression uncertainty (for bounding box proposals and DensePose UV-maps) by prediction of a Gaussian distribution instead of a single target value. The higher the predicted variance the higher the uncertainty.
7. Now given pixel-wise uncertainties we can sample for the second round of training only those foreground points from the selected 100k points which have the highest confidence. We have experimented with different sampling strategies and show that sampling based on the confidences from fine-grained tasks (24-way body part segmentation, UV-maps) results in the bests performance.
8. The network retrained on the augmented data (existing human annotations + pseudo-labeled animals) show a significant performance boost on the hold out manually annotated DensePose-Chimps dataset.
The video demonstration of the self-trained model: youtu.be/OU3Ayg_l4QM
9. We also show that the proposed Auto-Calibrated RCNN improves over the baseline even without self-training on standard DensePose-Human, detection and segmentation tasks. This is due to the higher robustness of the proposed model to unseen data distributions at test time.
โ๏ธConclusion:
- Studied the problem of extending dense body pose recognition to animal species and suggested that doing this at scale requires learning from unlabelled data;
- demonstrated that existing detection, segmentation, and dense pose labeling models can transfer very well to a proximal animal class such as chimpanzee despite significant inter-class differences;
- introduced Auto-Calibrated DensePose-RCNN which can estimate the uncertainty of its predictions;
- introduced novel DensePose-Chips dataset for benchmarking dense pose prediction for Chimpanzees;
- showed that substantial improvements can be obtained by carefully selecting which categories to use to pre-train the model, by using a class-agnostic architecture to integrate different sources of information;
- and by modeling labeling uncertainty to grade pseudo-labels for self-training;
- achieved excellent performance without using a single labeled image of the target class for training.
Guys, I did some rebranding and also opened a channel in Instagram where I will post more high level information about new papers and research.
https://www.instagram.com/gradientdude/ subscribe!
https://www.instagram.com/gradientdude/ subscribe!