ViPE: Video Pose Engine for Geometric 3D Perception

ViPE teaser

TL;DR: ViPE is an open-source spatial AI tool for annotating camera poses and dense depth maps from raw videos.

Contributors: NVIDIA Spatial Intelligence Lab, Dynamic Vision Lab, NVIDIA Isaac, and NVIDIA Research.

ViPE estimates camera intrinsics, camera motion, and dense near-metric depth maps from unconstrained raw videos. It is designed for varied real-world footage, including dynamic selfie videos, cinematic shots, dashcams, wide-angle videos, and 360-degree panoramas.

ViPE was used to annotate a large-scale video collection containing around 100K real-world internet videos, 1M high-quality AI-generated videos, and 2K panoramic videos, totaling approximately 96M frames.

What ViPE Produces

  • Camera poses
  • Camera intrinsics
  • Dense depth maps
  • Optional instance masks
  • Optional visualization videos
  • Optional reusable artifacts for downstream tools