For low-power robotics where the energy is constrained either due to size or duration, we would like to be able to use a DNN as a “monocular depth sensor” to navigate the environment instead of carrying a bulky or power-hungry LiDAR, structured light, or stereo camera. However, DNNs sometimes fail and give inaccurate depth predictions. We would like to know when the DNN depth prediction can be trusted, and we would like to do so energy efficiently.
DNNs can fail due to two sources of uncertainty: aleatoric uncertainty, meaning uncertainty inherent to the data, or epistemic uncertainty, meaning uncertainty inherent to the model. We show a toy example on 1D regression; aleatoric uncertainty can occur in-distribution or out-of-distribution of the training points and is correlated to the noise inherent to the data. Meanwhile, epistemic uncertainty occurs on out-of-distribution inputs where we do not have training points, representing uncertainty about novel examples not seen during training. If we only predict aleatoric uncertainty as in this case, we see that we can miss capturing our total predictive uncertainty, especially on out-of-distribution inputs!
Caption: An example of 1D regression with a NN with three 100 unit hidden layers and predicted aleatoric uncertainty in-distribution (between dashed lines [-4, 4]) and out-of-distribution (left and right of dashed lines). Aleatoric uncertainty alone does not capture the epistemic uncertainty out-of-distribution. (Credit: Ari Grayzel)
While aleatoric uncertainty can be estimated efficiently using a single pass of the DNN, most methods to estimate epistemic uncertainty, such as ensembles and sampling-based methods (e.g., MC-Dropout) require M inferences per input, making obtaining total uncertainty computationally expensive. For example, an ensemble of size 10 or 10 MC-Dropout samples will increase latency and energy at test time by approximately a factor of 10!
We introduce our algorithm, Uncertainty from Motion (UfM), that allows us to run only one inference per input while still obtaining close to ensemble uncertainty quality. The content of this blog post is based on our paper “Uncertainty from Motion for DNN Monocular Depth Estimation” (ICRA 2022), so feel free to check it out for more technical details!
Uncertainty from Motion (UfM)
UfM works on top of any DNN ensemble or sampling-based method that outputs both a depth prediction and an aleatoric uncertainty prediction, and requires no retraining of the DNN. The intuition behind the algorithm is that robots are navigating the environment based on video inputs, and video inputs contain a lot of temporal redundancy. The same point in 3D space may be seen from multiple images in the video sequence from different camera poses. Assuming we can obtain a pose estimate of the robot (a reasonable assumption since robots need to localize to navigate the environment), we can fuse together the depth prediction and aleatoric prediction from different ensemble members or sampled DNNs on a sequence of frames – essentially, ensembling over time instead of ensembling per image. We maintain a 3D Gaussian for each 3D point seen and update the Gaussian with a new depth prediction and aleatoric uncertainty prediction when we see another view of the same point. The update is done via an update to an incremental uniform mixture of Gaussians. Because the complexity of UfM is dependent on the size of the point cloud of 3D Gaussians, we limit the point cloud size and threshold points not in the current view or that haven’t been seen recently.
We show a video example of UfM vs. an ensemble on the same sequence below to highlight the similarity in uncertainty prediction at a fraction of the cost. To be noted, while UfM can replicate ensemble uncertainty quality, if ensemble uncertainty quality is poor, UfM will also replicate the ensemble’s poor uncertainty quality.
To summarize, UfM can be applied to any ensemble or sampling-based DNN that outputs a depth prediction and aleatoric variance and reduces the number of inferences needed for total uncertainty estimation to just one per input, instead of M inferences per input. The computational overhead is lightweight enough to make the savings significant, and there are some interesting future directions on how we can reduce the memory overhead as well.