MetricScenes

Inference with MoGe-2 and WildMoGe

MoGe-2 tends to predict a smaller scale for background structures on in-the-wild scenes, while WildMoGe recovers the correct scale. On ETH3D scenes, they produce similar results, with WildMoGe slightly more accurate. The videos first show MoGe-2's result, then WildMoGe's result.

SELECT AN EXAMPLE

Abstract

Metric scale monocular geometry estimation has seen significant progress through large-scale data aggregation, yet current foundation models suffer from a persistent ''scale-collapse'' phenomenon: distant landmarks and vast landscapes are metrically underestimated. We hypothesize that this performance gap stems from a training data bottleneck, where existing metric-scale datasets are hardware-constrained to homogenous vehicle-captured LiDAR or short-range indoor scans, or consist of synthetic data that lacks the semantic complexity of the physical world. To bridge this gap, we curate a new metrically-grounded, in-the-wild dataset that we call MetricScenes, gathered from a variety of sources including Internet photo collections and stereo imagery. We estimate camera poses and initial depth maps for each scene using off-the-shelf methods, and recover absolute scale from geo-tagged metadata as well as known stereo camera baselines. We also improve the quality of depth maps derived from MetricScenes via a new two-stage Poisson completion method. Fine-tuning MoGe-2 on our dataset significantly mitigates scale-collapse and achieves superior metric accuracy in unconstrained, open-domain scenes while maintaining state-of-the-art performance on standard benchmarks.

The MetricScenes Dataset

To fill critical gaps and add diversity missing in existing real-world metric datasets, we leverage widely available visual sources, including Internet photo collections and stereo imagery. These sources provide the environmental and semantic variety missing from existing hardware-constrained datasets. We reconstruct camera viewpoints and initial depth maps via off-the-shelf methods, then recover absolute physical scale by leveraging geolocated landmark metadata and stereo camera baselines. Specifically, we aggregate data from MegaScenes, AerialMegaDepth, and Stereo4D, and develop pipelines to extract metric-scale depth maps in each case.

The MetricScenes dataset contains 29,583 images across 356 scenes from MegaScenes, 47,579 images from 134 scenes in Aerial-MegaDepth (only real-world images), and 22,549 frames from 1,725 videos in Stereo4D.

Recovering Metric Scale in Training Data

Metric depth from Internet photo collections. Geo-tagged images obtained from online mapping sites can be used to scale SfM results to absolute metric scale. AerialMegaDepth is reconstructed with pseudo-synthetic views rendered from Google Earth and scenes are scaled accordingly. MegaScenes contains natively unscaled SfM results. We augment these SfM models with georeferenced street-level views to scale the geometry to physical dimensions. After scaling and running MVS, we apply a depth filtering method to remove transient objects (yellow box) and filter out depth-bleeding regions (red box).

Metric Depth from Stereo4D. Top: Standard stereo matching often produces distorted geometry in poorly calibrated in-the-wild videos, as seen in the converging facades (magenta boxes). Among multi-view models, π³ maintains the most robust geometry and sharp local details (cyan boxes). Bottom: We process stereoscopic sequences via π³ to obtain dense geometry and poses, then compute a global scale factor to align the predicted baseline with the camera's physical baseline. This yields accurately scaled metric depth maps.

Depth Fusion and Completion

Depth maps derived from SfM and MVS lack transient foreground objects. To fill these gaps, naive Poisson completion uses the MVS depth map as a fixed boundary condition, and relies on the gradient guidance of model-predicted depth maps to reconstruct the missing areas.

Naive Poisson completion often fails to produce accurate depth maps because the predicted depth maps (e.g. by MoGe-2) used for guidance suffer from scale-collapse, causing the solver to unnaturally enlarge foreground objects or distort boundaries.

To remedy this, instead of relying solely on background constraints, our key insight is to use both background and foreground depths as joint anchors while carefully optimizing object edges. We propose to fuse MVS depth maps (background) with MoGe-2 predicted depth maps (foreground), with a two-stage edge-aware Poisson completion algorithm.

Stage 1 performs a Poisson solve using the gradient guidance of MoGe-2's depth map, constrained by the filtered MVS results. This yields a base depth map with an accurate background scale, though the foreground exhibits scale-drift.
In Stage 2, the background isolated from the base depth map (base BG) and the MoGe-2 foreground (MoGe-2 FG) serve as joint anchors for the final completion. We perform an edge-weighted Poisson solve using these anchors, reusing MoGe-2's gradient guidance to produce a globally consistent metric depth map with sharp local details.

WildMoGe

WildMoGe is a MoGe-2 ViT-Large-Normal model fine-tuned on our MetricScenes dataset.

Metrology of novel in-the-wild scenes. The first column shows images with measurements obtained via Google Maps' measuring tool. We merge WildMoGe and MoGe-2's results into a single column to highlight the accurate scaling achieved by our training scheme. WildMoGe consistently recovers more accurate absolute scales across diverse landmarks, whereas MoGe-2, DepthAnything v3 and Metric3D v2 exhibit scale-collapse, underestimating the scale of background structures. While Unidepth v2 produces more realistic scales, they still deviate from ground truth. DepthPro often produces scales orders of magnitude smaller than reality.

Comparison on standard scenes. We compare WildMoGe against MoGe-2 on representative indoor and street-level scenes. In standard indoor and street contexts (Rows 1 & 2), WildMoGe provides scale estimates consistent with MoGe-2. On the ETH3D courtyard scene (Row 3), WildMoGe achieves better accuracy, recovering a desk leg height of 71.6cm compared to the 72cm ground truth. This suggests that WildMoGe's performance is driven by precise metric grounding rather than a bias toward larger scales.

Quantitative Evaluations

We evaluate on our curated test set and the standard benchmarks to demonstrate that our model achieves performance comparable to the state-of-the-art in specialized environments while effectively bridging the gap between hardware-constrained training data and unconstrained, in-the-wild scenarios.

Quantitative evaluation of relative and metric geometry. The top section evaluates on the standard benchmarks, while the bottom section evaluates on MetricScenes test set. Metrics are color-coded: red (best) and yellow (second best).

Acknowledgment

This work was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) grant funded by the Korean Government (MSIT) (No. RS-2024-00457882, National AI Research Lab Project), and the Major Program of the National Natural Science Foundation of China (Grant No. 62595772).

Honey, I Shrunk the Arc de Triomphe!