ViPE: Video Pose Engine for Geometric 3D Perception¶

ViPE teaser

TL;DR: ViPE is an open-source spatial AI tool for annotating camera poses and dense depth maps from raw videos.

Contributors: NVIDIA Spatial Intelligence Lab, Dynamic Vision Lab, NVIDIA Isaac, and NVIDIA Research.

ViPE estimates camera intrinsics, camera motion, and dense near-metric depth maps from unconstrained raw videos. It is designed for varied real-world footage, including dynamic selfie videos, cinematic shots, dashcams, wide-angle videos, and 360-degree panoramas.

ViPE was used to annotate a large-scale video collection containing around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames.

Links¶

What ViPE Produces¶

Camera poses
Camera intrinsics
Dense depth maps
Optional instance masks
Optional visualization videos
Optional reusable artifacts for downstream tools