Future of human-computer interaction — the 10-year vision by Facebook Reality Labs
Say you decide to walk to your local cafe to get some work done. You’re wearing a pair of AR glasses and a soft wristband. As you head out the door, your Assistant asks if you’d like to listen to the latest episode of your favorite podcast. A small movement of your finger lets you click “play.”
As you enter the cafe, your Assistant asks, “Do you want me to put in an order for a 12-ounce Americano?” Not in the mood for your usual, you again flick your finger to click “no.”
You head to a table, but instead of pulling out a laptop, you pull out a pair of soft, lightweight haptic gloves. When you put them on, a virtual screen and keyboard show up in front of you and you begin to edit a document. Typing is just as intuitive as typing on a physical keyboard and you’re on a roll, but the noise from the cafe makes it hard to concentrate.
Read more about the vision of the future of HCI at Facebok Reality Labs (FRL) blogpost.
Say you decide to walk to your local cafe to get some work done. You’re wearing a pair of AR glasses and a soft wristband. As you head out the door, your Assistant asks if you’d like to listen to the latest episode of your favorite podcast. A small movement of your finger lets you click “play.”
As you enter the cafe, your Assistant asks, “Do you want me to put in an order for a 12-ounce Americano?” Not in the mood for your usual, you again flick your finger to click “no.”
You head to a table, but instead of pulling out a laptop, you pull out a pair of soft, lightweight haptic gloves. When you put them on, a virtual screen and keyboard show up in front of you and you begin to edit a document. Typing is just as intuitive as typing on a physical keyboard and you’re on a roll, but the noise from the cafe makes it hard to concentrate.
Read more about the vision of the future of HCI at Facebok Reality Labs (FRL) blogpost.
Tech at Meta
Inside Facebook Reality Labs: The next era of human-computer interaction - Tech at Meta
Inside Facebook Reality Labs: The next era of human-computer interactionInside Facebook Reality Labs: The next era of human-computer interactionTL;DR: In today’s post — the first in a series exploring the future of human-computer interaction (HCI) — we’ll…
Media is too big
VIEW IN TELEGRAM
Ultra-low-friction AR interface will be built on two technological pillars:
1. Ultra-low-friction input, so when you need to act, the path from thought to action is as short and intuitive as possible. You might gesture with your hand, make voice commands, or select items from a menu by looking at them — actions enabled by hand-tracking cameras, a microphone array, and eye-tracking technology.
But ultimately, you’ll need a more natural way - neural input, e.g. wrist-based electromyography (EMG).
Wrist-based EMG reads the signals on the motor neurons that run from the spinal cord to the hand. The signals through the wrist are so clear that EMG can detect finger motion of just a millimeter. Ultimately it may even be possible to sense just the intent to move a finger.
1. Ultra-low-friction input, so when you need to act, the path from thought to action is as short and intuitive as possible. You might gesture with your hand, make voice commands, or select items from a menu by looking at them — actions enabled by hand-tracking cameras, a microphone array, and eye-tracking technology.
But ultimately, you’ll need a more natural way - neural input, e.g. wrist-based electromyography (EMG).
Wrist-based EMG reads the signals on the motor neurons that run from the spinal cord to the hand. The signals through the wrist are so clear that EMG can detect finger motion of just a millimeter. Ultimately it may even be possible to sense just the intent to move a finger.
2. The second pillar is the use of AI, context, and personalization to scope the effects of your input actions to your needs at any given moment. AI should adapt the input interface to the context/environment and, ideally, anticipate the user's needs.
I strongly recommend watching the Keynote talk by FRL Chief Scientist Michael Abrash. The FRL projects are very ambitious.
I strongly recommend watching the Keynote talk by FRL Chief Scientist Michael Abrash. The FRL projects are very ambitious.
Continuing the discussion about novel Human-Computer Interfaces 🦾
Technologies & Startups that Hack The Brain: Beyond the Healthcare Market
A review of 30 startups, their markets, business models, tech, and where machine learning fits in.
This article has a rather wide view on neurotech, and brain-computer interfaces (BCIs, both invasive and noninvasive) and various technologies, e.g. electroencephalography (EEG), electromyography (EMG), functional near-infrared spectroscopy (fNIRS), and others. It also covers neuromodulation that partially overlaps with the BCIs space.
Technologies & Startups that Hack The Brain: Beyond the Healthcare Market
A review of 30 startups, their markets, business models, tech, and where machine learning fits in.
This article has a rather wide view on neurotech, and brain-computer interfaces (BCIs, both invasive and noninvasive) and various technologies, e.g. electroencephalography (EEG), electromyography (EMG), functional near-infrared spectroscopy (fNIRS), and others. It also covers neuromodulation that partially overlaps with the BCIs space.
This media is not supported in your browser
VIEW IN TELEGRAM
Gucci and Belarusian startup Wanna created virtual sneakers.
You can buy then at Gucci app for $12 or at Wanna Kicks app for $9 🤭
I'm not a big fan of such applications. While I appreciate the efforts of the Wanna team - they went a long way since the last year and the shoes fit the foot much better now, but such sneakers still look a bit toyish in my opinion. To make the material look more realistic one would need to adapt the rendering to the current lighting conditions and shadows.
Would you use this app?
Video from @futuresailors.
You can buy then at Gucci app for $12 or at Wanna Kicks app for $9 🤭
I'm not a big fan of such applications. While I appreciate the efforts of the Wanna team - they went a long way since the last year and the shoes fit the foot much better now, but such sneakers still look a bit toyish in my opinion. To make the material look more realistic one would need to adapt the rendering to the current lighting conditions and shadows.
Would you use this app?
Video from @futuresailors.
Whatsup people 🤙🏼,
Today is ICCV submission deadline. And it is very tricky to write a good Introduction in your paper.
But today Prof. Kate Saenko (our Russian speaking part of the channel should probably know her) shares her experience and shows a template which she gives to new graduate students 🙂.
#phd_tips
🌐 How to Write the Introduction in 3 Easy Steps.
Today is ICCV submission deadline. And it is very tricky to write a good Introduction in your paper.
But today Prof. Kate Saenko (our Russian speaking part of the channel should probably know her) shares her experience and shows a template which she gives to new graduate students 🙂.
#phd_tips
🌐 How to Write the Introduction in 3 Easy Steps.
Not bad HTC! Looks like everyone is trying to create its own VR helmet. Face tracking and hand movements look impressive. However, manipulation part is still not comfortable. I don't want to hold those sticks all the time😐
https://tttttt.me/ai_newz/344
https://tttttt.me/ai_newz/344
Telegram
эйай ньюз
Еще фейс-трекинга от HTC Vive. В начале немножко криповато, но в целом возможности впечатляют. Движутся они точно в верном направлении.
Nice infographics about the amounts of data uploaded and consumed everyday. Although it was created in 2019. Now the numbers has doubled at least IMO.
Full resolution
Full resolution
How to easily edit and compose images like in Photoshop using GANs?
MIT
🎯Task:
Given an incomplete image or a collage of images, generate a realistic image from it.
🔑Method:
This paper presents a simple approach – given a fixed pretrained generator (e.g., StyleGAN), they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable location of the latent space, which then the generator maps onto the
image manifold. Such an approach enables more localized editing of individual image parts compared to direct editing in the latent space
📚Interesting findings:
- Even though our regressor is never trained on unrealistic and incoherent collages, it projects the given image into a reasonable latent code.
- Authors show that the representation of the generator is already compositional in the latent code. Meaning that altering the part of the input image, will result in a change of the regressed latent code in the corresponding location.
➕Pros:
- As input, we need only a single example of approximately how we want the generated image to look (can be a collage of different images).
- Requires only one forward pass of the regressor and generator -> fast, unlike iterative optimization approaches that can require up to a minute to reconstruct an image. https://arxiv.org/abs/1911.11544
- Does not require any labeled attributes.
💬Applications
- Image inpainting.
- Example-based image editing (incoherent collage -> to realistic image).
#paper_explained #cv
📝 Paper: Using latent space regression to analyze and leverage compositionality in GANs
🌐 Project page
⚒ Code
📓 Colab
MIT
🎯Task:
Given an incomplete image or a collage of images, generate a realistic image from it.
🔑Method:
This paper presents a simple approach – given a fixed pretrained generator (e.g., StyleGAN), they train a regressor network to predict
the latent code from an input image. To teach the regressor to predict the latent code for images w/ missing pixels they mask random patches during training.
Now, given an input collage, the regressor projects it into a reasonable location of the latent space, which then the generator maps onto the
image manifold. Such an approach enables more localized editing of individual image parts compared to direct editing in the latent space
📚Interesting findings:
- Even though our regressor is never trained on unrealistic and incoherent collages, it projects the given image into a reasonable latent code.
- Authors show that the representation of the generator is already compositional in the latent code. Meaning that altering the part of the input image, will result in a change of the regressed latent code in the corresponding location.
➕Pros:
- As input, we need only a single example of approximately how we want the generated image to look (can be a collage of different images).
- Requires only one forward pass of the regressor and generator -> fast, unlike iterative optimization approaches that can require up to a minute to reconstruct an image. https://arxiv.org/abs/1911.11544
- Does not require any labeled attributes.
💬Applications
- Image inpainting.
- Example-based image editing (incoherent collage -> to realistic image).
#paper_explained #cv
📝 Paper: Using latent space regression to analyze and leverage compositionality in GANs
🌐 Project page
⚒ Code
📓 Colab
Learning to resize: Replace a front-end resizer in deep networks by a learnable non-linear resizer
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably there’s a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
📝 Paper
No code yet
#cv #paper_explained
Google Research
Deep computer vision models can benefit greatly from replacing a fixed linear resizer which you use to downsample Imagenet images before training with a well-designed, learned, nonlinear resizer.
Structure of the learned resizer is specific; not just adding more generic convolutional layers to the baseline model. Looks like it strives to encode some extra information in the downsampled image. From there stems the extra perfromance on Imagenet.
This work shows that a generically deeper model can be improved upon w/ a well-designed front-end, task-optimized, processor.
Looking ahead: probably there’s a lot of room for work on task-optimized pre-processing modules for computer vision and other tasks.
📝 Paper
No code yet
#cv #paper_explained
🔥New video on my YouTube channel!🔥
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
🎯 Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
❓ How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
✏️ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
The key idea of the NEX method is to approximate this function
To summarize, the modified MPI contains the following parameters per pixel:
Another set of parameters -- global basis matrices
The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
▶️ Video explanation
🌐 NEX project page
📝 NEX paper
⏱ Realtime demo
💠 Multiplane Images (MPI)
💠 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
I have created a detailed video explanation of the paper "NeX: Real-time View Synthesis with Neural Basis Expansion"
🎯 Task
Given a set of photos (10-60 photos) of the scene, learn some 3D representation of the scene which would allow rendering the scene from novel camera poses.
❓ How?
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates in the set of bases defined by the basis functions) depends on the pixel coordinates
(x,y,z)
, but not on the viewing angle. In contrast, basis functions depend only on the viewing angle and are the same for every pixel if the angle is fixed. Such angle and coordinates decoupling allows for caching all pixel representations which results in a 100x speedup of novel scene rendering (60FPS!). Moreover, the proposed scene parametrization allows the rendering of specular objects (non-Lambertian) with complex view-dependent effects.✏️ Detailed approach summary
Multiplane image is a 3D scene representation that consists of a collection of D planar images, each with dimension
H × W × 4
where the last dimension contains RGB values and alpha transparency values. These planes are scaled and placed equidistantly either in the depth space (for bounded close-up objects) or inverse depth space (for scenes that extend out to infinity) along a reference viewing frustum.One main limitation of MPI is that it can only model diffuse or Lambertian surfaces, whose colors appear constant regardless of the viewing angle. In real-world scenes, many objects are non-Lambertian such as a ceramic plate, a glass table, or a metal wrench.
Regressing the color directly from the viewing angle
v
(and the pixel location [x,y,z]
) with a neural network F(x, y, z, v)
, as is done in NERF, is very inefficient for real-time rendering as it requires to recompute every voxel in the volume for every new camera pose.The key idea of the NEX method is to approximate this function
F(x, y, z, v)
with a linear combination of learnable basis functions {H_n(v): R^2 → R^{3x3}}
.To summarize, the modified MPI contains the following parameters per pixel:
α, k0, k1 , . . . , k_N
. These parameters are predicted by neural network f(x, y, z)
for every pixel.Another set of parameters -- global basis matrices
H1(v) , H2(v), . . . , H_N(v)
which are shared across all pixels but depend on the viewing angle v
. The columns of H_n(v)
are basis vectors of some color space different from RGB space. These basis matrices are predicted by another neural network g(v) = [H1(v) , H2(v), . . . , H_N(v)]
.The motivation for using the second network is to ensure that the prediction of the basis functions is independent of the voxel coordinates. This allows to precompute and cache the output of
f(x, y, z)
for all coordinates. Therefore a novel view can be synthesized by just a single forward pass of network g(v)
, because f()
does not depend on v
and we don't need to recompute it.Comparing with NeRF, the proposed MPI can be thought of as a discretized sampling of an implicit radiance field function which is decoupled on view-dependent basis functions
H_n(v)
and view-independent parameters α
and k_n
, n=1...N
.▶️ Video explanation
🌐 NEX project page
📝 NEX paper
⏱ Realtime demo
💠 Multiplane Images (MPI)
💠 NeRF: Representing Scenes as Neural Radiance Fields for View Synthesis
#paper_explained #cv #video_exp
YouTube
NeX: Real-time View Synthesis with Neural Basis Expansion + NERF [Paper explaned]
The proposed approach uses a modification of Multiplane Image (MPI), where it models view-dependent effects by parameterizing each pixel as a linear combination of basis functions learned by a neural network. The pixel representation (i.e., the coordinates…
🧑🎓 Some NEF implementation details:
- Fine details or high-frequency content tends to come from the surface texture itself and not necessarily from a complex scene geometry. Authors found that simply storing the first coefficient
-
- computing and storing all N + 1 coefficients
- Total reconstruction loss = Photometric loss + difference of the image gradients.
- For a scene with 17 input photos of resolution 1008x756, the training takes ~ 18 hours using 1xNVIDIA V100 with a batch size of 1.
- To render one pixel, NEX uses 0.16 MFLOPs, whereas NeRF uses 226 MFLOPs.
-
-
- Positional encodings (sin and cos with different frequencies) of the inputs x,y,z, and v are used instead of the raw values. Such a mapping into a higher dimensional space enables the MLPs to more easily approximate a high-frequency function of variation in color and geometry of the scene.
- Fine details or high-frequency content tends to come from the surface texture itself and not necessarily from a complex scene geometry. Authors found that simply storing the first coefficient
k0
, or “base color,” explicitly helps ease the network’s burden of compressing and reproducing detail and leads to sharper results, also in fewer iterations.-
k0
is optimized explicitly as a learnable parameter with a total variation regularizer.- computing and storing all N + 1 coefficients
(k0, k1, ... kN)
for all pixels for all D depth planes can be expensive for both training and rendering. So authors use a coefficient sharing scheme where every M consecutive planes will share the same coefficients, but not the alphas- Total reconstruction loss = Photometric loss + difference of the image gradients.
- For a scene with 17 input photos of resolution 1008x756, the training takes ~ 18 hours using 1xNVIDIA V100 with a batch size of 1.
- To render one pixel, NEX uses 0.16 MFLOPs, whereas NeRF uses 226 MFLOPs.
-
f()
is an MLP with 6 FC layers, each with 384 hidden nodes.-
g()
is an MLP with 3 FC layers, each with 64 hidden nodes.- Positional encodings (sin and cos with different frequencies) of the inputs x,y,z, and v are used instead of the raw values. Such a mapping into a higher dimensional space enables the MLPs to more easily approximate a high-frequency function of variation in color and geometry of the scene.
Controllable Neural Text Generation
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
🌀 Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP
Self-supervised pretraining of Language Models has become a de-facto standard nowadays. When generating sentences from a Language Model by iteratively sampling the next token, we do not have much control over attributes of the output text, such as the topic, the style, the sentiment, etc. Many applications would demand good control over the model output. For example, if we plan to use LM to generate reading materials for kids, we would like to guide the output stories to be safe, educational, and easily understood by children.
How to steer a powerful unconditioned language model? Note that model steerability is still an open research question. In this blogpost,
Lilian Weng (OpenAI) discusses several approaches for acontrolled content generation with an unconditioned language model:
- Apply guided decoding strategies and select desired outputs at test time.
- Optimize for the most desired outcomes via good prompt design.
- Fine-tune the base model or steerable layers to do conditioned content generation.
🌀 Blogpost link
--
P.S. Lilian Weng has a very informative blog with a lot of interesting posts mostly on Reinforcement Learning and Natural Language Processing.
#NLP
FastNeRF: High-Fidelity Neural Rendering at 200FPS
Smart ideas do not come in the only head. FastNeRF has the same idea as in NeX, but a bit different implementation.
The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions
Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars
Then 2 neural networks
⚔️ NeX(➖) vs FastNeRF(➕):
➖ NeX achieves ~60Fps and PSNR 27 on the Real Forward-Facing dataset using Nvidia 2080Ti . While FastNeRF - 50-80 fps and PSNR 26 using Nvidia RTX 3090 GPU, fps rate varies from scene to scene. 200fps is only achieved on synthetic datasets or lower resolution.
➖FastNeRF requires a bit more memory to cache their scene representation because NeX uses sparse representation along the depth direction (only 192 slices) and share
➖ The same idea - factorize color representation. Predict
➕ FastNeRF attempted to justify such factorization by the Rendering equation (see the image with an intuitive explanation below).
➕ Very similar architecture. NeX has one extra learnable Tensor which represents average RGB colors independent of ray direction. All other components are learned by neural networks. FastNeRF learns everything with neural nets.
➖ NeX has more extensive experiments and also experiments on fixed basis functions (i.e., compute
➖ NeX introduces and evaluated on more challenging Shiny objects dataset. Would be interesting to see the results on FastNeRF on the same dataset as well.
❗️👉 Overall, I would say the approaches are very similar with some minor implementation differences. One would need to combine both implementations to get the best result.
🌐 FastNeRF paper
Unfortunately there are video results available and no code of FastNeRF yet.
Smart ideas do not come in the only head. FastNeRF has the same idea as in NeX, but a bit different implementation.
The main idea is to factorize the voxel color representation into two independent components: one that depends only on positions
p=(x,y,z)
of the voxel and one that depends only on the ray directions v
.Essentially you predict K different (R,G,B) values for ever voxel and K weighting scalars
H_i(v)
for each of them:color(x,y,z) = RGB_1 * H_1 + RGB_2 * H_2 + ... + RGB_K * H_K
.Then 2 neural networks
f(x,y,z) = [RGB_1, ... RGB_K]
and h(v) = [H1, .. H_K]
are learn to predict color components an their weights. After that the values of these functions are cached for every voxel in the volume which enables swift online rendering.⚔️ NeX(➖) vs FastNeRF(➕):
➖ NeX achieves ~60Fps and PSNR 27 on the Real Forward-Facing dataset using Nvidia 2080Ti . While FastNeRF - 50-80 fps and PSNR 26 using Nvidia RTX 3090 GPU, fps rate varies from scene to scene. 200fps is only achieved on synthetic datasets or lower resolution.
➖FastNeRF requires a bit more memory to cache their scene representation because NeX uses sparse representation along the depth direction (only 192 slices) and share
RGB_i
values between every 12 consecutive depth planes.➖ The same idea - factorize color representation. Predict
K
RGB values for every voxel instead of a single one. ➕ FastNeRF attempted to justify such factorization by the Rendering equation (see the image with an intuitive explanation below).
➕ Very similar architecture. NeX has one extra learnable Tensor which represents average RGB colors independent of ray direction. All other components are learned by neural networks. FastNeRF learns everything with neural nets.
➖ NeX has more extensive experiments and also experiments on fixed basis functions (i.e., compute
H_i(v)
using the Fourier’s series, spherical harmonics, etc). Interestingly, using the Fourier's series instead of neural network g(v)
yields only a slightly worse PSNR score and even better LPIPS score.➖ NeX introduces and evaluated on more challenging Shiny objects dataset. Would be interesting to see the results on FastNeRF on the same dataset as well.
❗️👉 Overall, I would say the approaches are very similar with some minor implementation differences. One would need to combine both implementations to get the best result.
🌐 FastNeRF paper
Unfortunately there are video results available and no code of FastNeRF yet.