Depth Maps in Computer Vision: From Stereo Geometry to Neural Networks
This post explores how computers perceive depth in images, bridging classical geometric methods with modern neural approaches. Familiarity with basic linear algebra, projective geometry, and convolutional neural networks is helpful but not required.
Related Posts:
- Image Matting: Estimating Accurate Mask Edges - Complementary segmentation technique for foreground extraction
- Video Matting: Temporal Consistency - Temporal processing for video, similar to depth sequence processing
- Matrix Determinants and Linear Algebra - Mathematical foundations used in epipolar geometry
Table of Contents
- Introduction: The Challenge of Depth Perception
- What is a Depth Map?
- Classical Approaches to Depth Estimation
- The Mathematics of Stereo Vision
- Stereo Matching Algorithms
- Deep Learning for Depth Estimation
- Applications of Depth Maps
- Challenges and Future Directions
- Key Takeaways
- Further Reading
Introduction: The Challenge of Depth Perception
One of the most fundamental challenges in computer vision is understanding the three-dimensional structure of the world from two-dimensional images. Humans effortlessly perceive depth through binocular vision, motion parallax, and learned visual cues. But how can we enable machines to do the same?
Depth maps are the primary representation computers use to encode 3D structure. A depth map is simply an image where each pixel value represents the distance from the camera to the corresponding point in the scene. This seemingly simple representation enables a vast array of applications: autonomous navigation, augmented reality, 3D reconstruction, robot manipulation, and computational photography.
The journey from 2D images to 3D depth has evolved from purely geometric approaches based on triangulation and multi-view geometry to modern neural networks that can estimate depth from a single image by learning statistical priors about the 3D world.
What is a Depth Map?
A depth map (also called a depth image or range image) is a 2D representation where each pixel encodes the distance from the camera sensor to the corresponding surface in the scene.
Mathematically, for an image with dimensions $W \times H$, the depth map is a function:
\[D: \{1, \ldots, W\} \times \{1, \ldots, H\} \rightarrow \mathbb{R}^+\]where $D(u, v)$ represents the depth (distance) at pixel coordinates $(u, v)$.
Representation Formats
Depth maps can be represented in several ways:
- Metric Depth: Absolute distance in physical units (meters, millimeters)
- Most accurate and useful for robotics and measurement
- Requires calibration
- Inverse Depth ($1/Z$):
- More uniformly distributed for scenes with varying depth ranges
- Linearizes depth with respect to disparity
- Common in structure-from-motion pipelines
- Normalized Depth: Values scaled to $[0, 1]$ or $[0, 255])
- Useful for visualization
- Loses absolute scale information
- Disparity: Pixel displacement between stereo image pairs
- Directly computed from stereo matching
- Inversely proportional to depth
Depth vs Disparity
In stereo vision, we often work with disparity rather than depth directly. For a calibrated stereo rig with baseline $b$ and focal length $f$, the relationship is:
\[Z = \frac{b \cdot f}{d}\]where:
- $Z$ is the depth (distance to the scene point)
- $d$ is the disparity (pixel difference between left and right images)
- $b$ is the baseline (distance between camera centers)
- $f$ is the focal length
This inverse relationship means:
- Large disparity → close objects (small $Z$)
- Small disparity → far objects (large $Z$)
Classical Approaches to Depth Estimation
Before the deep learning revolution, depth estimation relied primarily on geometric principles and active sensing techniques.
Stereo Vision
Stereo vision mimics human binocular vision by using two cameras separated by a baseline. The key insight: a point in 3D space projects to different pixel locations in each camera. By finding these corresponding points (solving the correspondence problem), we can triangulate the 3D position.
Process:
- Calibrate both cameras (intrinsic and extrinsic parameters)
- Rectify images so epipolar lines are horizontal
- For each pixel in the left image, find the corresponding pixel in the right image
- Compute disparity: $d = u_L - u_R)
- Convert disparity to depth: $Z = \frac{bf}{d})
Advantages:
- Passive sensing (no special lighting required)
- Dense depth maps possible
- Works at various scales
Challenges:
- Textureless regions lack features for matching
- Occlusions create ambiguities
- Repetitive patterns cause false matches
- Requires accurate calibration
Structured Light
Structured light systems actively project known patterns onto the scene and observe their deformation to infer depth.
Common Patterns:
- Laser lines: Single line scanned across the scene
- Random dots: Pseudo-random pattern (used in Kinect v1, iPhone Face ID)
- Binary codes: Temporal coding schemes using multiple projected patterns
- Phase shifting: Sinusoidal patterns with varying phase
Example: Kinect v1 uses an infrared laser projector to cast a pseudo-random dot pattern. By comparing the observed pattern to a reference pattern, the system computes disparity and hence depth.
Advantages:
- High accuracy
- Works on textureless surfaces
- Real-time capable
Challenges:
- Fails in bright sunlight (IR saturation)
- Multiple sensors interfere with each other
- Limited to indoor/controlled environments
Time-of-Flight Sensors
Time-of-Flight (ToF) sensors measure the time it takes for emitted light to return after reflecting off surfaces.
\[Z = \frac{c \cdot \Delta t}{2}\]where:
- $c$ is the speed of light
- $\Delta t$ is the round-trip time
Two Variants:
- Pulsed ToF: Measures actual flight time of light pulses
- Continuous-wave ToF: Measures phase shift of modulated light
Examples: Microsoft Kinect v2, Intel RealSense, smartphone LiDAR sensors
Advantages:
- Fast acquisition
- Works on textureless surfaces
- Simple computation
Challenges:
- Lower resolution than cameras
- Multi-path interference
- Limited range
Photometric Stereo
Photometric stereo estimates surface normals (and hence depth) by capturing multiple images under different lighting directions.
Given $n$ images ${I_1, \ldots, I_n}$ with known light directions ${\mathbf{s}_1, \ldots, \mathbf{s}_n}$, and assuming Lambertian reflectance:
\[I_i = \rho \mathbf{n} \cdot \mathbf{s}_i\]where:
- $\rho$ is albedo (surface reflectance)
- $\mathbf{n}$ is the surface normal
With $n \geq 3$ lights, we can solve for $\mathbf{n}$ and $\rho$, then integrate normals to obtain depth.
The Mathematics of Stereo Vision
Stereo vision is the most widely used passive depth estimation method. Understanding its mathematical foundations is crucial.
Epipolar Geometry
Epipolar geometry describes the geometric relationship between two views of a scene.
Key Concepts:
- Epipole ($\mathbf{e}$): Projection of one camera center into the other camera
- Epipolar plane: Plane containing both camera centers and a 3D point
- Epipolar line: Intersection of epipolar plane with image plane
The Epipolar Constraint: For a point $\mathbf{x}_L$ in the left image, its correspondence $\mathbf{x}_R$ in the right image must lie on the corresponding epipolar line.
This is encoded by the fundamental matrix $\mathbf{F}$:
\[\mathbf{x}_R^T \mathbf{F} \mathbf{x}_L = 0\]For calibrated cameras, we use the essential matrix $\mathbf{E}$:
\[\mathbf{x}_R^T \mathbf{E} \mathbf{x}_L = 0\]where $\mathbf{E} = \mathbf{K}_R^T \mathbf{F} \mathbf{K}_L$ and $\mathbf{K}$ are camera intrinsic matrices.
The Disparity-Depth Relationship
Consider a calibrated stereo rig with parallel optical axes (rectified configuration):
Left camera projects 3D point $(X, Y, Z)$ to:
\[u_L = f \frac{X}{Z}, \quad v_L = f \frac{Y}{Z}\]Right camera (shifted by baseline $b$ along X-axis) projects to:
\[u_R = f \frac{X - b}{Z}, \quad v_R = f \frac{Y}{Z}\]Disparity is defined as:
\[d = u_L - u_R = f \frac{X}{Z} - f \frac{X - b}{Z} = \frac{fb}{Z}\]Therefore:
\[Z = \frac{fb}{d}\]Key Observations:
- Depth $Z$ is inversely proportional to disparity $d)
- Larger baseline $b$ → larger disparity → better depth resolution
- Depth uncertainty grows quadratically with distance: $\sigma_Z \propto Z^2)
🎮 Interactive Exploration: Want to build intuition for this relationship? Check out our Interactive Depth vs Disparity Visualization where you can:
- Adjust baseline, focal length, and disparity in real-time
- See 3D stereo camera geometry from multiple viewpoints
- Explore the hyperbolic depth-disparity curve
- Understand why depth uncertainty grows with distance
- Experiment with different stereo camera configurations
Stereo Rectification
Rectification transforms stereo images so that:
- Epipolar lines are horizontal and parallel to image rows
- Corresponding points have the same $v$ coordinate
- Correspondence search reduces to 1D horizontal search
Algorithm:
- Compute fundamental matrix $\mathbf{F}$ from point correspondences
- Compute rectification homographies $\mathbf{H}_L, \mathbf{H}_R)
- Warp both images: $I’_L = \mathbf{H}_L(I_L)$, $I’_R = \mathbf{H}_R(I_R))
After rectification, for pixel $(u_L, v)$ in left image, the correspondence is at $(u_R, v)$ in right image where $u_R < u_L$.
Correspondence Problem
The correspondence problem is finding which pixel in the right image corresponds to a given pixel in the left image.
Challenges:
- Ambiguity: Multiple similar-looking regions
- Occlusions: Point visible in one view but not the other
- Textureless regions: No distinctive features
- Specular reflections: Appearance changes between views
- Repetitive patterns: Many plausible matches
Cost Functions measure similarity between patches:
- Sum of Absolute Differences (SAD): $\sum \lvert I_L(u+i, v+j) - I_R(u-d+i, v+j) \rvert)
- Sum of Squared Differences (SSD): $\sum (I_L(u+i, v+j) - I_R(u-d+i, v+j))^2)
- Normalized Cross-Correlation (NCC): More robust to illumination changes
- Census Transform: Encodes local structure, robust to illumination
- Mutual Information: Information-theoretic measure
Stereo Matching Algorithms
Stereo matching algorithms can be categorized by how they optimize the matching cost.
Local Methods
Local methods compute disparity independently for each pixel based on a local window.
Winner-Takes-All (WTA):
\[d^*(u, v) = \arg\min_{d \in [d_{\min}, d_{\max}]} C(u, v, d)\]where $C(u, v, d)$ is the matching cost for pixel $(u, v)$ at disparity $d$.
Advantages:
- Fast computation (parallelizable)
- Simple implementation
- Low memory requirements
Disadvantages:
- Noisy in low-texture regions
- No global consistency
- Window size trade-off: small → noisy, large → blurred boundaries
Global Methods
Global methods formulate stereo matching as an energy minimization problem:
\[E(D) = E_{\text{data}}(D) + \lambda E_{\text{smooth}}(D)\]where:
- $E_{\text{data}}(D) = \sum_{(u,v)} C(u, v, D(u, v))$: Data term (matching cost)
- $E_{\text{smooth}}(D) = \sum_{(u,v) \sim (u’,v’)} V(D(u, v), D(u’, v’))$: Smoothness term
- $\lambda$: Regularization weight
Common Smoothness Terms:
- L1: $V(d_p, d_q) = \lvert d_p - d_q \rvert$ (preserves discontinuities)
- L2: $V(d_p, d_q) = (d_p - d_q)^2$ (smooth but blurs edges)
- Potts model: $V(d_p, d_q) = [d_p \neq d_q]$ (piecewise constant)
- Truncated linear: $V(d_p, d_q) = \min(\lvert d_p - d_q \rvert, \tau))
Optimization Methods:
- Graph cuts: Efficiently solves certain energy functions
- Belief propagation: Message-passing on graphical models
- Dynamic programming: Scanline optimization (ignores vertical coherence)
- Variational methods: Continuous optimization with PDEs
Advantages:
- Smoother, more coherent depth maps
- Better handles occlusions
- Enforces geometric constraints
Disadvantages:
- Computationally expensive
- May over-smooth depth discontinuities
Semi-Global Matching
Semi-Global Matching (SGM) by Hirschmüller (2008) balances accuracy and efficiency.
Key Idea: Approximate global optimization by aggregating costs along multiple 1D paths.
Algorithm:
- Compute pixel-wise matching cost $C(u, v, d))
- For each pixel and each of $r$ directions (typically 8 or 16): \(L_r(u, v, d) = C(u, v, d) + \min \begin{cases} L_r(u-r_u, v-r_v, d) \\ L_r(u-r_u, v-r_v, d \pm 1) + P_1 \\ \min_i L_r(u-r_u, v-r_v, i) + P_2 \end{cases}\)
- Aggregate costs: $S(u, v, d) = \sum_r L_r(u, v, d))
- WTA: $d^*(u, v) = \arg\min_d S(u, v, d))
Penalty Terms:
- $P_1$: Small penalty for ±1 disparity change
- $P_2$: Large penalty for large disparity changes
Advantages:
- Near real-time on modern CPUs
- High accuracy (used in autonomous vehicles)
- Handles textureless regions well
Deep Learning for Depth Estimation
The deep learning revolution has transformed depth estimation, enabling previously impossible capabilities.
Monocular Depth Estimation
Monocular depth estimation predicts depth from a single image—a task impossible with classical geometry alone. Deep networks learn depth cues from data:
Geometric Cues:
- Perspective (parallel lines converge)
- Occlusion (foreground blocks background)
- Relative size (known object sizes)
Photometric Cues:
- Atmospheric perspective (distant objects are hazy)
- Defocus blur (depth-of-field effects)
- Shading and shadows
Early Architectures (2014-2016):
Eigen et al. (2014) pioneered end-to-end CNN depth estimation:
- Multi-scale architecture: coarse + fine networks
- Loss functions: depth, gradient, and normal losses
Network Structure:
Input RGB → CNN Encoder → Multi-Scale Decoder → Depth Map
Modern Architectures (2017-present):
Encoder-Decoder with Skip Connections:
- Encoder: ResNet, EfficientNet, or Swin Transformer
- Decoder: Progressive upsampling with skip connections
- Similar to semantic segmentation networks (U-Net style)
Example: DPT (Dense Prediction Transformer, 2021):
Input → Vision Transformer Encoder → Reassemble → Convolutional Decoder → Depth
Loss Functions:
-
Scale-Invariant Loss (Eigen et al.): \(\mathcal{L}_{\text{si}} = \frac{1}{n}\sum_i d_i^2 - \frac{\lambda}{n^2}\left(\sum_i d_i\right)^2\) where $d_i = \log \hat{Z}_i - \log Z_i$ and $\lambda = 0.5)
-
Gradient Loss (preserves depth boundaries): \(\mathcal{L}_{\text{grad}} = \sum_i \lvert \nabla_x \hat{D}_i - \nabla_x D_i \rvert + \lvert \nabla_y \hat{D}_i - \nabla_y D_i \rvert\)
-
Multi-Scale Loss (captures both local and global structure): \(\mathcal{L}_{\text{ms}} = \sum_{s=1}^{4} \alpha_s \mathcal{L}(\hat{D}^s, D^s)\)
Self-Supervised Learning
Obtaining ground truth depth at scale is expensive. Self-supervised methods leverage geometry and photometric consistency.
Monodepth2 (Godard et al., 2019):
Training Setup:
- Input: Stereo image pairs or video sequences
- Predict: Depth map $D$ and ego-motion $T$ (for video)
- No ground truth depth required!
Photometric Loss:
The key insight: if depth is correct, we can warp one view to another and it should match.
Given left image $I_L$, predicted depth $D_L$, we can synthesize right image:
\[\hat{I}_R(u, v) = I_L\left(u + \frac{bf}{D_L(u, v)}, v\right)\]Loss function:
\[\mathcal{L}_{\text{photo}} = \sum_{u,v} \rho\left(I_R(u, v), \hat{I}_R(u, v)\right)\]where $\rho$ is a robust photometric loss (often SSIM + L1).
Additional Components:
- Minimum reprojection loss: Handles occlusions
- Auto-masking: Ignores static camera frames
- Multi-scale prediction: Improves boundary accuracy
Advantages:
- No LiDAR or depth sensors needed
- Scalable to massive datasets (YouTube videos!)
- Learns realistic depth priors
Challenges:
- Scale ambiguity (depth only up to unknown scale)
- Struggles with non-Lambertian surfaces
- Occlusions and dynamic objects cause issues
Multi-View Stereo Networks
Multi-View Stereo (MVS) networks leverage multiple viewpoints with known camera poses.
MVSNet (Yao et al., 2018):
Key Innovation: Build a 3D cost volume in a learned feature space.
Algorithm:
- Extract 2D features from each view using CNN: ${F_1, \ldots, F_N})
- For each reference view and depth hypothesis $d$:
- Warp features from other views to reference view at depth $d)
- Compute similarity (variance or learned metric)
- Build 3D cost volume $C(u, v, d))
- Regularize cost volume with 3D CNN
- Regress depth with soft argmin: \(\hat{d}(u, v) = \sum_d d \cdot \text{softmax}(C(u, v, d))\)
Advantages:
- Explicitly encodes multi-view geometry
- Handles wide baselines
- State-of-the-art reconstruction quality
Applications:
- 3D scanning
- Novel view synthesis (Neural Radiance Fields)
- Autonomous driving
Transformer-Based Approaches
Recent work leverages Vision Transformers for depth estimation.
Key Advantages:
- Global receptive field (vs. local CNNs)
- Better long-range dependencies
- State-of-the-art performance
DPT (Dense Prediction Transformer, 2021):
- Pre-trained Vision Transformer (ViT) as encoder
- Convolutional decoder with skip connections
- Achieves best results on multiple benchmarks
Adabins (2021):
- Adaptive binning for depth range
- Transformer module for range attention
- Handles both indoor and outdoor scenes
MIM (Masked Image Modeling) Pre-training:
- Pre-train on masked image reconstruction (like BERT)
- Fine-tune for depth estimation
- Improves data efficiency
Applications of Depth Maps
Depth maps enable a vast array of applications across computer vision and robotics.
1. Autonomous Driving
Depth Estimation is critical for:
- Obstacle detection: Identify pedestrians, vehicles, barriers
- Path planning: Compute drivable space
- 3D object detection: Estimate size and position of objects
- Scene understanding: Semantic segmentation + depth
Examples: Tesla Autopilot, Waymo, Cruise use multi-camera depth estimation
2. Augmented Reality (AR)
Depth Maps enable realistic AR:
- Occlusion handling: Virtual objects appear behind real objects
- Collision detection: Virtual objects interact with real surfaces
- Spatial understanding: Place objects on floors, walls
- Hand tracking: Depth improves gesture recognition
Examples: Apple ARKit, Google ARCore, Meta Quest
3. 3D Reconstruction
Structure from Motion (SfM) and SLAM use depth:
- 3D scanning: Reconstruct objects and environments
- Photogrammetry: Create 3D models from images
- Cultural heritage: Digitize artifacts and monuments
- VFX and gaming: Create realistic 3D assets
4. Robotics
Depth Perception is essential for:
- Grasping: Estimate object pose and shape
- Navigation: Build occupancy maps
- Manipulation: Plan collision-free paths
- Human-robot interaction: Maintain safe distances
5. Computational Photography
Depth-Based Effects:
- Portrait mode: Realistic bokeh (background blur)
- Refocusing: Change focus after capture
- 3D photos: Parallax effects (Facebook 3D Photos)
- Lighting adjustment: Relight scenes based on geometry
6. Medical Imaging
Depth Reconstruction in:
- Endoscopy: 3D structure of internal organs
- Surgery planning: Anatomical modeling
- Prosthetics: Custom-fit devices
Challenges and Future Directions
Despite remarkable progress, depth estimation still faces significant challenges.
Current Limitations
- Transparent and Reflective Surfaces
- Glass, mirrors, water violate standard assumptions
- Specular reflections confuse stereo matching
- Recent work: physics-based models, polarization
- Textureless Regions
- Large uniform areas (walls, sky) lack features
- CNNs help by learning priors, but still struggle
- Structured light and ToF work better
- Scale Ambiguity in Monocular Depth
- Single-image depth is only up to scale
- Need absolute metric depth for robotics
- Solutions: multi-task learning with scale supervision
- Generalization Across Domains
- Models trained on outdoor scenes fail indoors
- Different camera parameters require retraining
- Domain adaptation and meta-learning help
- Real-Time Performance
- High-resolution depth at video rates is demanding
- Mobile deployment requires model compression
- Specialized hardware (neural accelerators) emerging
Emerging Directions
1. Neural Radiance Fields (NeRF) and 3D Gaussians
- Represent scenes as continuous functions
- Jointly optimize geometry and appearance
- Enable photorealistic novel view synthesis
2. Depth Completion
- Fuse sparse LiDAR with dense RGB
- Best of both worlds: accurate + dense
- Critical for autonomous driving
3. Learned Multi-View Stereo
- Replace hand-crafted matching costs with learned features
- Adaptive cost aggregation
- Hybrid differentiable optimization
4. Event Cameras for Depth
- Neuromorphic sensors with microsecond latency
- Ideal for high-speed motion
- Research frontier: event-based stereo
5. Depth from Defocus
- Use lens optics and focus variation
- Single-camera solution
- Emerging in smartphone photography
6. Foundation Models for 3D
- Pre-train on massive 3D datasets
- Zero-shot depth estimation
- Universal scene understanding (Depth Anything, ZoeDepth)
Key Takeaways
-
Depth maps encode 3D structure as 2D images where pixel values represent distance from the camera—fundamental for understanding scene geometry.
- Classical methods rely on geometric principles:
- Stereo vision triangulates depth from two viewpoints
- Structured light and ToF actively sense distance
- Accuracy depends on calibration and scene properties
-
Stereo matching is an optimization problem balancing data fidelity and smoothness, with Semi-Global Matching providing an excellent speed-accuracy trade-off.
- Deep learning revolutionized depth estimation:
- Monocular networks predict depth from single images by learning 3D priors
- Self-supervised methods train without ground truth depth
- Transformers achieve state-of-the-art by capturing global context
-
Applications span autonomous driving, AR, robotics, computational photography, and 3D reconstruction—depth is foundational to spatial computing.
- Future directions include neural 3D representations (NeRF), foundation models for universal depth estimation, and hybrid approaches combining geometric constraints with learned priors.
Depth estimation bridges classical computer vision geometry with modern deep learning, demonstrating how first principles and data-driven methods can complement each other to solve fundamental perception problems.
Further Reading
Seminal Papers
Classical Stereo:
- Scharstein & Szeliski (2002): “A Taxonomy and Evaluation of Dense Two-Frame Stereo Correspondence Algorithms”
- Hirschmüller (2008): “Stereo Processing by Semiglobal Matching and Mutual Information”
Deep Monocular Depth:
- Eigen et al. (2014): “Depth Map Prediction from a Single Image using a Multi-Scale Deep Network”
- Godard et al. (2019): “Digging Into Self-Supervised Monocular Depth Estimation”
Multi-View Stereo:
- Yao et al. (2018): “MVSNet: Depth Inference for Unstructured Multi-view Stereo”
- Schönberger & Frahm (2016): “Structure-from-Motion Revisited” (COLMAP)
Transformers for Depth:
- Ranftl et al. (2021): “Vision Transformers for Dense Prediction”
- Bhat et al. (2021): “AdaBins: Depth Estimation Using Adaptive Bins”
Textbooks
- Hartley & Zisserman: Multiple View Geometry in Computer Vision (2004)
- Szeliski: Computer Vision: Algorithms and Applications (2022)
- Forsyth & Ponce: Computer Vision: A Modern Approach (2012)
Datasets and Benchmarks
- KITTI: Autonomous driving benchmark with LiDAR ground truth
- NYU Depth V2: Indoor scenes with Kinect depth
- SceneNet: Synthetic indoor scenes
- Middlebury Stereo: High-quality stereo pairs with ground truth
Online Resources
What applications do you envision for depth estimation? How might future hardware and algorithms change the way we perceive and interact with 3D space?