In many robotics and VR/AR applications, fast camera motions cause a high level of motion blur, causing existing camera pose estimation methods to fail. In this work, we propose a novel framework that leverages motion blur as a rich cue for motion estimation rather than treating it as an unwanted artifact. Our approach works by predicting a dense motion flow field and a monocular depth map directly from a single motion-blurred image. We then recover the instantaneous camera velocity by solving a linear least squares problem under the small motion assumption. In essence, our method produces an IMU-like measurement that robustly captures fast and aggressive camera movements. To train our model, we construct a large-scale dataset with realistic synthetic motion blur derived from ScanNet++v2 and further refine our model by training end-to-end on real data using our fully differentiable pipeline. Extensive evaluations on real-world benchmarks demonstrate that our method achieves state-of-the-art angular and translational velocity estimates, outperforming current methods like MASt3R and COLMAP.
A commonly held assumption for camera motion is that each image represents a single instantaneous snapshot in time. However, in reality the camera undergoes continuous motion across the exposure time, which can render motion blur under fast motions. While this can be problematic for standard pose estimation methods, we aim to exploit the motion blur artifact as a rich source of information to help estimate the motion.
Given a single image, our method predicts a dense motion flow field and a monocular depth map. We then recover the instantaneous camera velocity with linear least squares using the known exposure time and intrinsics. To disambiguate the velocity direction in a video, we use the predicted motion to compute the photometric error between the current frame and the previous/next frames and flip the direction if necessary.
We conduct all evaluations using recorded footage from an iPhone 13 Pro with the StrayScanner app . The app is slightly modified to obtain the exposure time from ARKit.
We evaluate our method on real-world motion-blurred videos. While the baseline methods must use multiple frames to compute the velocity, our network only takes a single frame as input. Because the true direction of a single motion-blurred image is ambiguous, we flip the velocity direction as necessary based on the photometric error between frames. We directly treat the gyroscope readings as the angular velocity ground truth, and we approximate the translational velocity ground truth using the ARKit poses and framerate. Note that the angular velocity axes are x-up, y-left, z-backwards (using the IMU convention) whereas the the translational velocity axes are x-right, y-down, z-forward (using OpenCV convention).
Our network uses a single frame as input in complete isolation from the rest of the video. Try the slider to see how our method recovers the camera motion at a particular frame!
We show the runtimes of our method against the baselines on one of the real-world videos. All methods are run on an Nvidia RTX 3090. Even including the direction disambiguation, our method is significantly faster and runs in real-time at 30 Hz.
@article{chen2025_imageimu,
title = {{Image as an IMU}: Estimating Camera Motion from a Single Motion-Blurred Image},
author = {Chen, Jerred and Clark, Ronald},
journal = {arXiv preprint},
year = {2025}
}