Gradient Dude
2.54K subscribers
180 photos
50 videos
2 files
169 links
TL;DR for DL/CV/ML/AI papers from an author of publications at top-tier AI conferences (CVPR, NIPS, ICCV,ECCV).

Most ML feeds go for fluff, we go for the real meat.

YouTube: youtube.com/c/gradientdude
IG instagram.com/gradientdude
Download Telegram
🔥New video on my YouTube channel!🔥
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"

🎯 Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.

How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates (x,y,z), but not on the viewing angle. In contrast, basis functions depend only on the viewing angle and are the same for every pixel if the angle is fixed. Such angle and coordinates decoupling allows for caching all pixel representations which results in a 100x speedup of novel scene rendering (60FPS!). Moreover, the proposed scene parametrization allows the rendering of specular objects (non-Lambertian) with complex view-dependent effects.

✏️ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension H × W × 4 where the last dimension contains RGB values and alpha transparency values. These planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.

One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.

Regressing the color directly from the viewing angle v (and the pixel location [x,y,z]) with a neural network F(x, y, z, v), as is done in NERF, is very inefficient for real-time rendering as it requires to recompute every voxel in the volume for every new camera pose.

The key idea of the NEX method is to approximate this function F(x, y, z, v) with a linear combination of learnable basis functions {H_n(v): R^2 → R^{3x3}}.

To summarize, the modified MPI contains the following parameters per pixel: α, k0, k1 , . . . , k_N. These parameters are predicted by neural network f(x, y, z) for every pixel.

Another set of parameters -- global basis matrices H1(v) , H2(v), . . . , H_N(v) which are shared across all pixels but depend on the viewing angle v. The columns of H_n(v) are basis vectors of some color space different from RGB space. These basis matrices are predicted by another neural network g(v) = [H1(v) , H2(v), . . . , H_N(v)].

The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of f(x, y, z) for all coordinates. Therefore a novel view can be synthesized by just a single forward pass of network g(v), because f() does not depend on v and we don't need to recompute it.

Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions H_n(v) and view-independent parameters α and k_n, n=1...N.

▶️ Video explanation
🌐 NEX project page
📝 NEX paper
Realtime demo

💠 Multiplane Images (MPI)
💠 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp