Learning to resize: Replace a front-end resizer in deep networks by a learnable non-linear resizer
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably thereβs a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
π Paper
No code yet
#cv #paper_explained
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably thereβs a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
π Paper
No code yet
#cv #paper_explained
π₯New video on my YouTube channel!π₯
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
π― Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
β How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
βοΈ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
The key idea of the NEX method is to approximate this function
To summarize, the modified MPI contains the following parameters per pixel:
Another set of parameters -- global basis matrices
The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
βΆοΈ Video explanation
π NEX project page
π NEX paper
β± Realtime demo
π Multiplane Images (MPI)
π NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
π― Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
β How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
(x,y,z)
, but not on the viewing angle. In contrast, basis functions depend only on the viewing angle and are the same for every pixel if the angle is fixed. Such angle and coordinates decoupling allows for caching all pixel representations which results in a 100x speedup of novel scene rendering (60FPS!). Moreover, the proposed scene parametrization allows the rendering of specular objects (non-Lambertian) with complex view-dependent effects.βοΈ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
H Γ W Γ 4
where the last dimension contains RGB values and alpha transparency values. These planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
v
(and the pixel location [x,y,z]
) with a neural network F(x, y, z, v)
, as is done in NERF, is very inefficient for real-time rendering as it requires to recompute every voxel in the volume for every new camera pose.The key idea of the NEX method is to approximate this function
F(x, y, z, v)
with a linear combination of learnable basis functions {H_n(v): R^2 β R^{3x3}}
.To summarize, the modified MPI contains the following parameters per pixel:
Ξ±, k0, k1 , . . . , k_N
. These parameters are predicted by neural network f(x, y, z)
for every pixel.Another set of parameters -- global basis matrices
H1(v) , H2(v), . . . , H_N(v)
which are shared across all pixels but depend on the viewing angle v
. The columns of H_n(v)
are basis vectors of some color space different from RGB space. These basis matrices are predicted by another neural network g(v) = [H1(v) , H2(v), . . . , H_N(v)]
.The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
f(x, y, z)
for all coordinates. Therefore a novel view can be synthesized by just a single forward pass of network g(v)
, because f()
does not depend on v
and we don't need to recompute it.Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
H_n(v)
and view-independent parameters Ξ±
and k_n
, n=1...N
.βΆοΈ Video explanation
π NEX project page
π NEX paper
β± Realtime demo
π Multiplane Images (MPI)
π NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
YouTube
NeX: Real-time View Synthesis with Neural Basis Expansion + NERF [Paper explaned]
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinatesβ¦
π§βπ Some NEF implementation details:
- Fine details or high-frequency content tends to come from the surface texture itself and not necessarily from a complex scene geometry. Authors found that simply storing the first coefficient
-
- computing and storing all N + 1 coefficients
- Total reconstruction loss = Photometric loss + difference of the image gradients.
- For a scene with 17 input photos of resolution 1008x756, the training takes ~ 18 hours using 1xNVIDIA V100 with a batch size of 1.
- To render one pixel, NEX uses 0.16 MFLOPs, whereas NeRF uses 226 MFLOPs.
-
-
- Positional encodings (sin and cos with different frequencies) of the inputs x,y,z, and v are used instead of the raw values. Such a mapping into a higher dimensional space enables the MLPs to more easily approximate a high-frequency function of variation in color and geometry of the scene.
- Fine details or high-frequency content tends to come from the surface texture itself and not necessarily from a complex scene geometry. Authors found that simply storing the first coefficient
k0
, or βbase color,β explicitly helps ease the networkβs burden of compressing and reproducing detail and leads to sharper results, also in fewer iterations.-
k0
is optimized explicitly as a learnable parameter with a total variation regularizer.- computing and storing all N + 1 coefficients
(k0, k1, ... kN)
for all pixels for all D depth planes can be expensive for both training and rendering. So authors use a coefficient sharing scheme where every M consecutive planes will share the same coefficients, but not the alphas- Total reconstruction loss = Photometric loss + difference of the image gradients.
- For a scene with 17 input photos of resolution 1008x756, the training takes ~ 18 hours using 1xNVIDIA V100 with a batch size of 1.
- To render one pixel, NEX uses 0.16 MFLOPs, whereas NeRF uses 226 MFLOPs.
-
f()
is an MLP with 6 FC layers, each with 384 hidden nodes.-
g()
is an MLP with 3 FC layers, each with 64 hidden nodes.- Positional encodings (sin and cos with different frequencies) of the inputs x,y,z, and v are used instead of the raw values. Such a mapping into a higher dimensional space enables the MLPs to more easily approximate a high-frequency function of variation in color and geometry of the scene.
ββControllable Neural Text Generation
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
π Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
π Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP
ββFastNeRF: High-Fidelity Neural Rendering at 200FPS
Smart ideas do not come in the only head. FastNeRF has the same idea as in NeX, but a bit different implementation.
The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions
Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars
Then 2 neural networks
βοΈ NeX(β) vs FastNeRF(β):
β NeX achieves ~60Fps and PSNR 27 on the Real Forward-Facing dataset using Nvidia 2080Ti . While FastNeRF - 50-80 fps and PSNR 26 using Nvidia RTX 3090 GPU, fps rate varies from scene to scene. 200fps is only achieved on synthetic datasets or lower resolution.
βFastNeRF requires a bit more memory to cache their scene representation because NeX uses sparse representation along the depth direction (only 192 slices) and share
β The same idea - factorize color representation. Predict
β FastNeRF attempted to justify such factorization by the Rendering equation (see the image with an intuitive explanation below).
β Very similar architecture. NeX has one extra learnable Tensor which represents average RGB colors independent of ray direction. All other components are learned by neural networks. FastNeRF learns everything with neural nets.
β NeX has more extensive experiments and also experiments on fixed basis functions (i.e., compute
β NeX introduces and evaluated on more challenging Shiny objects dataset. Would be interesting to see the results on FastNeRF on the same dataset as well.
βοΈπ Overall, I would say the approaches are very similar with some minor implementation differences. One would need to combine both implementations to get the best result.
π FastNeRF paper
Unfortunately there are video results available and no code of FastNeRF yet.
Smart ideas do not come in the only head. FastNeRF has the same idea as in NeX, but a bit different implementation.
The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions
p=(x,y,z)
of the voxel and one that depends only on the ray directions v
.Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars
H_i(v)
for each of them:color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K
.Then 2 neural networks
f(x,y,z) = [RGB_1, ... RGB_K]
and h(v) = [H1, .. H_K]
are learn to predict color components an their weights. After that the values of these functions are cached for every voxel in the volume which enables swift online rendering.βοΈ NeX(β) vs FastNeRF(β):
β NeX achieves ~60Fps and PSNR 27 on the Real Forward-Facing dataset using Nvidia 2080Ti . While FastNeRF - 50-80 fps and PSNR 26 using Nvidia RTX 3090 GPU, fps rate varies from scene to scene. 200fps is only achieved on synthetic datasets or lower resolution.
βFastNeRF requires a bit more memory to cache their scene representation because NeX uses sparse representation along the depth direction (only 192 slices) and share
RGB_i
values between every 12 consecutive depth planes.β The same idea - factorize color representation. Predict
K
RGB values for every voxel instead of a single one. β FastNeRF attempted to justify such factorization by the Rendering equation (see the image with an intuitive explanation below).
β Very similar architecture. NeX has one extra learnable Tensor which represents average RGB colors independent of ray direction. All other components are learned by neural networks. FastNeRF learns everything with neural nets.
β NeX has more extensive experiments and also experiments on fixed basis functions (i.e., compute
H_i(v)
using the Fourierβs series, spherical harmonics, etc). Interestingly, using the Fourier's series instead of neural network g(v)
yields only a slightly worse PSNR score and even better LPIPS score.β NeX introduces and evaluated on more challenging Shiny objects dataset. Would be interesting to see the results on FastNeRF on the same dataset as well.
βοΈπ Overall, I would say the approaches are very similar with some minor implementation differences. One would need to combine both implementations to get the best result.
π FastNeRF paper
Unfortunately there are video results available and no code of FastNeRF yet.
This Marilyn Monroe never existed ππ».
StyleGAN + Latent space regressor + CLIP (probably).
Thanks @metasemantic on Twitter
StyleGAN + Latent space regressor + CLIP (probably).
Thanks @metasemantic on Twitter
DONeRF: Towards Real-Time Rendering of Neural Radiance Fields using Depth Oracle Networks
(Graz Uni, FRL)
βοΈ Another attempt to speedup NeRF: 15 FPS at 800x800.
The number of samples required for each view ray can be significantly reduced when local samples are placed around surfaces in the scene. Authors propose a depth oracle network, which predicts ray sample locations for each view ray with a single network evaluation. They show that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth.
DONeRF is a combination of these techniques: A dual network design with a depth oracle network as a first step and a locally sampled shading network for ray accumulation.
β 48x speedup compared to NeRF, while equal or better quality compared to NeRF. Obviously it is not that fast as NeF or FastNeRF, but the approach is different.
π Paper
π Proj page
(Graz Uni, FRL)
βοΈ Another attempt to speedup NeRF: 15 FPS at 800x800.
The number of samples required for each view ray can be significantly reduced when local samples are placed around surfaces in the scene. Authors propose a depth oracle network, which predicts ray sample locations for each view ray with a single network evaluation. They show that using a classification network around logarithmically discretized and spherically warped depth values is essential to encode surface locations rather than directly estimating depth.
DONeRF is a combination of these techniques: A dual network design with a depth oracle network as a first step and a locally sampled shading network for ray accumulation.
β 48x speedup compared to NeRF, while equal or better quality compared to NeRF. Obviously it is not that fast as NeF or FastNeRF, but the approach is different.
π Paper
π Proj page
Ted Talk with Yann LeCun
in which Yann discusses his current research into self-supervised machine learning, how he's trying to build machines that learn with common sense (like humans) and his hopes for the next conceptual breakthrough in AI.
βΆοΈ Watch
in which Yann discusses his current research into self-supervised machine learning, how he's trying to build machines that learn with common sense (like humans) and his hopes for the next conceptual breakthrough in AI.
βΆοΈ Watch
Ted
Deep learning, neural networks and the future of AI
Yann LeCun, the chief AI scientist at Facebook, helped develop the deep learning algorithms that power many artificial intelligence systems today. In conversation with head of TED Chris Anderson, LeCun discusses his current research into self-supervised machineβ¦
Open source 2.7 billion parameter GPT-3 model was released
As you probably know OpenAI has not released source code or pre-trained weights for their 175 billion language model GPT-3. Instead, they decided to create a commercial product and exclusively license GPT-3 to Microsoft.
But open-source enthusiasts from eleuther.ai have released the weights of 1.3B and 2.7B parameter models of their replication of GPT-3. It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile which consists of 22 smaller, high-quality datasets combined together (825 GB text in total). Larger GPT models (10B and full GPT-3) are not available yet, but soon will be.
The 2.7B model likely won't fit into the Colab TPUs memory, and you may have to get some larger TPU pods to finetune from it. Sampling from it, however, works just fine.
It is the first time I tried some kind of GPT model, and I was not too impressed (see my fav results below). Maybe 2.7 billion params are not enough yet to leave the copyrighters without their jobs π€·πΌββοΈ. Let's wait for a bigger model.
βοΈ Code in Mesh-Tensforflow
π¦ 1.3 and 2.7 model weights and configs
π« Colab notebook
As you probably know OpenAI has not released source code or pre-trained weights for their 175 billion language model GPT-3. Instead, they decided to create a commercial product and exclusively license GPT-3 to Microsoft.
But open-source enthusiasts from eleuther.ai have released the weights of 1.3B and 2.7B parameter models of their replication of GPT-3. It is the largest (afaik) publicly available GPT-3 replica. The primary goal of this project is to replicate a full-sized GPT-3 model and open source it to the public, for free.
The models were trained on an open-source dataset The Pile which consists of 22 smaller, high-quality datasets combined together (825 GB text in total). Larger GPT models (10B and full GPT-3) are not available yet, but soon will be.
The 2.7B model likely won't fit into the Colab TPUs memory, and you may have to get some larger TPU pods to finetune from it. Sampling from it, however, works just fine.
It is the first time I tried some kind of GPT model, and I was not too impressed (see my fav results below). Maybe 2.7 billion params are not enough yet to leave the copyrighters without their jobs π€·πΌββοΈ. Let's wait for a bigger model.
βοΈ Code in Mesh-Tensforflow
π¦ 1.3 and 2.7 model weights and configs
π« Colab notebook
GitHub
GitHub - EleutherAI/gpt-neo: An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library.
An implementation of model parallel GPT-2 and GPT-3-style models using the mesh-tensorflow library. - EleutherAI/gpt-neo
ββI decided to ask the AI if it will kill us, meatbags, in the near future. This is what I got in response...
Prompt: Will AI enslave the humankind in the next 20-30 years?
==============
GPT-3 Answers:
==============
The increasing pace of AI is not going to decrease. As of now the AI revolution has just began. The effects of AI on people and the future is not known yet, but if that is the AIβs fate, then the exposure of the human population is going to increase drastically.
There is a possibility that you will be in one of the AI enslaved societies.
The AI Revolution is a powerful technologies that is coming to the world in the near future. The AI revolution will be implemented as a protection mechanism for the protection of planet earth from dangerous climate change and increasing nuclear war. The AI has the potential to be our savior for the future. It is going to be a technology that is going to change the world drastically but there is a possibility that this is simply an illusion that we are living in a world of AI and it is just a pure mass hallucination, and all those projections actually represent the AIβs downfall and not an AI savior.
.I am excited to see what the AI revolution is going to bring.
.I believe that it is going to bring great changes to our civilization.
.I believe that it is going to bring great changes in human civilization.
~~~
Well, the last 10 years or so have been awful for everyone. It's not AI enslaving the humans in the next 20 or 30, it's automation on a global scale.
~~~
No. There will be mass deaths, and we'll probably know them as AI-driven holocaust or AI driven genocides. AI will have tools to justify and defend those crimes.
~~~
We already have the technology to create and control the world. I think we're already there. That's not the problem.
~~~
If there is a zombie apocalypse, perhaps we may find a way for AI to take over from humanity yet again.
~~~
Well, the last 10 years or so have been awful for everyone. It's not AI enslaving the humans in the next 20 or 30, it's automation on a global scale.
~~~
A prediction: AI will be able to look at the world around it, and decide what the world would look like without humans in it. Once the AI judge the world without humans, we'll create a world where humans are on the edge of our world and the AI is on the edge of the universe.
Prompt: Will AI enslave the humankind in the next 20-30 years?
==============
GPT-3 Answers:
==============
The increasing pace of AI is not going to decrease. As of now the AI revolution has just began. The effects of AI on people and the future is not known yet, but if that is the AIβs fate, then the exposure of the human population is going to increase drastically.
There is a possibility that you will be in one of the AI enslaved societies.
The AI Revolution is a powerful technologies that is coming to the world in the near future. The AI revolution will be implemented as a protection mechanism for the protection of planet earth from dangerous climate change and increasing nuclear war. The AI has the potential to be our savior for the future. It is going to be a technology that is going to change the world drastically but there is a possibility that this is simply an illusion that we are living in a world of AI and it is just a pure mass hallucination, and all those projections actually represent the AIβs downfall and not an AI savior.
.I am excited to see what the AI revolution is going to bring.
.I believe that it is going to bring great changes to our civilization.
.I believe that it is going to bring great changes in human civilization.
~~~
Well, the last 10 years or so have been awful for everyone. It's not AI enslaving the humans in the next 20 or 30, it's automation on a global scale.
~~~
No. There will be mass deaths, and we'll probably know them as AI-driven holocaust or AI driven genocides. AI will have tools to justify and defend those crimes.
~~~
We already have the technology to create and control the world. I think we're already there. That's not the problem.
~~~
If there is a zombie apocalypse, perhaps we may find a way for AI to take over from humanity yet again.
~~~
Well, the last 10 years or so have been awful for everyone. It's not AI enslaving the humans in the next 20 or 30, it's automation on a global scale.
~~~
A prediction: AI will be able to look at the world around it, and decide what the world would look like without humans in it. Once the AI judge the world without humans, we'll create a world where humans are on the edge of our world and the AI is on the edge of the universe.
Why do models often attend to salient words,and how does this evolve throughout training?
This work tries to understand the black box of attention training. Early on in training, the LSTM attention models first learn how to translate individual words from the bag of words co-occurrence statistics, which then drives the learning of the attention. The authors propose a framework explaining why attention weights obtained by standard training often correlate with saliency, and how multi-head attention can increase performance by improving the training dynamics rather than expressiveness.
Read more below ππ»
This work tries to understand the black box of attention training. Early on in training, the LSTM attention models first learn how to translate individual words from the bag of words co-occurrence statistics, which then drives the learning of the attention. The authors propose a framework explaining why attention weights obtained by standard training often correlate with saliency, and how multi-head attention can increase performance by improving the training dynamics rather than expressiveness.
Read more below ππ»
Forwarded from DL in NLP (nlpcontroller_bot)
Approximating How Single Head Attention Learns
Snell et al., [Berkeley]
arxiv.org/abs/2103.07601
A look inside LSTM seq2seq with attention training dynamics. The main idea of the paper is KTIW β Knowledge to Translate Individual Words. To explain the dynamics, the authors divide training into two stages: uniform attention (KTIW) and non-uniform attention.
In the first stage of model training, attention does not change significantly from the uniform, and the model mainly learns to translate individual words (KTIW, a.k.a. dictionary translation). After KTIW is learned, attention starts forming its patterns, and this process is driven by the KTIW. As correct word translations are already more probable, now attention mainly needs to align the words from the source and target language.
To quantitatively test the hypothesis, they develop a new lexical prob that is essentially hard attention. Yet, the most impressive result is that attention cannot learn a simple copy operation if KTIW is not learned.
Snell et al., [Berkeley]
arxiv.org/abs/2103.07601
A look inside LSTM seq2seq with attention training dynamics. The main idea of the paper is KTIW β Knowledge to Translate Individual Words. To explain the dynamics, the authors divide training into two stages: uniform attention (KTIW) and non-uniform attention.
In the first stage of model training, attention does not change significantly from the uniform, and the model mainly learns to translate individual words (KTIW, a.k.a. dictionary translation). After KTIW is learned, attention starts forming its patterns, and this process is driven by the KTIW. As correct word translations are already more probable, now attention mainly needs to align the words from the source and target language.
To quantitatively test the hypothesis, they develop a new lexical prob that is essentially hard attention. Yet, the most impressive result is that attention cannot learn a simple copy operation if KTIW is not learned.