We introduce Metric3D v2, a geometric foundation model designed for zero-shot metric depth and surface normal estimation from single images, critical for accurate 3D recovery. Depth and normal estimation, though complementary, present distinct challenges. State-of-the-art monocular depth methods achieve zero-shot generalization through affine-invariant depths, but fail to recover real-world metric scale. Conversely, current normal estimation techniques struggle with zero-shot performance due to insufficient labeled data. We propose targeted solutions for both metric depth and normal estimation. For metric depth, we present a canonical camera space transformation module that resolves metric ambiguity across various camera models and large-scale datasets, which can be easily integrated into existing monocular models. For surface normal estimation, we introduce a joint depth-normal optimization module that leverages diverse data from metric depth, allowing normal estimators to improve beyond traditional labels. Our model, trained on over 16 million images from thousands of camera models with varied annotations, excels in zero-shot generalization to new camera settings. As shown in Fig. 1, It ranks the 1st in multiple zero-shot and standard benchmarks for metric depth and surface normal prediction. Our method enables the accurate recovery of metric 3D structures on randomly collected internet images, paving the way for plausible single-image metrology. Our model also relieves the scale drift issues of monocular-SLAM (Fig. 3), leading to high-quality metric scale dense mapping. Such applications highlight the versatility of Metric3D v2 models as geometric foundation models.
Read full abstract