Depth Maps in Computer Vision: From Stereo Geometry to Neural Networks

January 15, 2026· computer-vision, depth-estimation, stereo-vision, 3d-reconstruction, deep-learning, neural-networks

This post explores how computers perceive depth in images, bridging classical geometric methods with modern neural approaches. Familiarity with basic linear algebra, projective geometry, and convolutional neural networks is helpful but not required.

Related Posts:

Image Matting: Estimating Accurate Mask Edges - Complementary segmentation technique for foreground extraction
Video Matting: Temporal Consistency - Temporal processing for video, similar to depth sequence processing
Matrix Determinants and Linear Algebra - Mathematical foundations used in epipolar geometry

Introduction: The Challenge of Depth Perception
What is a Depth Map?
- Representation Formats
- Depth vs Disparity
Classical Approaches to Depth Estimation
The Mathematics of Stereo Vision
Stereo Matching Algorithms
Deep Learning for Depth Estimation
Applications of Depth Maps
Challenges and Future Directions
Key Takeaways
Further Reading

Introduction: The Challenge of Depth Perception

One of the most fundamental challenges in computer vision is understanding the three-dimensional structure of the world from two-dimensional images. Humans effortlessly perceive depth through binocular vision, motion parallax, and learned visual cues. But how can we enable machines to do the same?

Depth maps are the primary representation computers use to encode 3D structure. A depth map is simply an image where each pixel value represents the distance from the camera to the corresponding point in the scene. This seemingly simple representation enables a vast array of applications: autonomous navigation, augmented reality, 3D reconstruction, robot manipulation, and computational photography.

The journey from 2D images to 3D depth has evolved from purely geometric approaches based on triangulation and multi-view geometry to modern neural networks that can estimate depth from a single image by learning statistical priors about the 3D world.

What is a Depth Map?

A depth map (also called a depth image or range image) is a 2D representation where each pixel encodes the distance from the camera sensor to the corresponding surface in the scene.

Mathematically, for an image with dimensions $W \times H$, the depth map is a function:

\[D: \{1, \ldots, W\} \times \{1, \ldots, H\} \rightarrow \mathbb{R}^+\]

where $D(u, v)$ represents the depth (distance) at pixel coordinates $(u, v)$.

Representation Formats

Depth maps can be represented in several ways:

Metric Depth: Absolute distance in physical units (meters, millimeters)
- Most accurate and useful for robotics and measurement
- Requires calibration
Inverse Depth ($1/Z$):
- More uniformly distributed for scenes with varying depth ranges
- Linearizes depth with respect to disparity
- Common in structure-from-motion pipelines
Normalized Depth: Values scaled to $[0, 1]$ or $[0, 255])
- Useful for visualization
- Loses absolute scale information
Disparity: Pixel displacement between stereo image pairs
- Directly computed from stereo matching
- Inversely proportional to depth

Depth vs Disparity

In stereo vision, we often work with disparity rather than depth directly. For a calibrated stereo rig with baseline $b$ and focal length $f$, the relationship is:

\[Z = \frac{b \cdot f}{d}\]

where:

$Z$ is the depth (distance to the scene point)
$d$ is the disparity (pixel difference between left and right images)
$b$ is the baseline (distance between camera centers)
$f$ is the focal length

This inverse relationship means:

Large disparity → close objects (small $Z$)
Small disparity → far objects (large $Z$)

Classical Approaches to Depth Estimation

Before the deep learning revolution, depth estimation relied primarily on geometric principles and active sensing techniques.

Stereo Vision

Stereo vision mimics human binocular vision by using two cameras separated by a baseline. The key insight: a point in 3D space projects to different pixel locations in each camera. By finding these corresponding points (solving the correspondence problem), we can triangulate the 3D position.

Process:

Calibrate both cameras (intrinsic and extrinsic parameters)
Rectify images so epipolar lines are horizontal
For each pixel in the left image, find the corresponding pixel in the right image
Compute disparity: $d = u_L - u_R)
Convert disparity to depth: $Z = \frac{bf}{d})

Advantages:

Passive sensing (no special lighting required)
Dense depth maps possible
Works at various scales

Challenges:

Textureless regions lack features for matching
Occlusions create ambiguities
Repetitive patterns cause false matches
Requires accurate calibration

Structured Light

Structured light systems actively project known patterns onto the scene and observe their deformation to infer depth.

Common Patterns:

Laser lines: Single line scanned across the scene
Random dots: Pseudo-random pattern (used in Kinect v1, iPhone Face ID)
Binary codes: Temporal coding schemes using multiple projected patterns
Phase shifting: Sinusoidal patterns with varying phase

Example: Kinect v1 uses an infrared laser projector to cast a pseudo-random dot pattern. By comparing the observed pattern to a reference pattern, the system computes disparity and hence depth.

Advantages:

High accuracy
Works on textureless surfaces
Real-time capable

Challenges:

Fails in bright sunlight (IR saturation)
Multiple sensors interfere with each other
Limited to indoor/controlled environments

Time-of-Flight Sensors

Time-of-Flight (ToF) sensors measure the time it takes for emitted light to return after reflecting off surfaces.

\[Z = \frac{c \cdot \Delta t}{2}\]

where:

$c$ is the speed of light
$\Delta t$ is the round-trip time

Two Variants:

Pulsed ToF: Measures actual flight time of light pulses
Continuous-wave ToF: Measures phase shift of modulated light

Examples: Microsoft Kinect v2, Intel RealSense, smartphone LiDAR sensors

Advantages:

Fast acquisition
Works on textureless surfaces
Simple computation

Challenges:

Lower resolution than cameras
Multi-path interference
Limited range

Photometric Stereo

Photometric stereo estimates surface normals (and hence depth) by capturing multiple images under different lighting directions.

Given $n$ images ${I_1, \ldots, I_n}$ with known light directions ${\mathbf{s}_1, \ldots, \mathbf{s}_n}$, and assuming Lambertian reflectance:

\[I_i = \rho \mathbf{n} \cdot \mathbf{s}_i\]

where:

$\rho$ is albedo (surface reflectance)
$\mathbf{n}$ is the surface normal

With $n \geq 3$ lights, we can solve for $\mathbf{n}$ and $\rho$, then integrate normals to obtain depth.

The Mathematics of Stereo Vision

Stereo vision is the most widely used passive depth estimation method. Understanding its mathematical foundations is crucial.

Epipolar Geometry

Epipolar geometry describes the geometric relationship between two views of a scene.

Key Concepts:

Epipole ($\mathbf{e}$): Projection of one camera center into the other camera
Epipolar plane: Plane containing both camera centers and a 3D point
Epipolar line: Intersection of epipolar plane with image plane

The Epipolar Constraint: For a point $\mathbf{x}_L$ in the left image, its correspondence $\mathbf{x}_R$ in the right image must lie on the corresponding epipolar line.

This is encoded by the fundamental matrix $\mathbf{F}$:

\[\mathbf{x}_R^T \mathbf{F} \mathbf{x}_L = 0\]

For calibrated cameras, we use the essential matrix $\mathbf{E}$:

\[\mathbf{x}_R^T \mathbf{E} \mathbf{x}_L = 0\]

where $\mathbf{E} = \mathbf{K}_R^T \mathbf{F} \mathbf{K}_L$ and $\mathbf{K}$ are camera intrinsic matrices.

The Disparity-Depth Relationship

Consider a calibrated stereo rig with parallel optical axes (rectified configuration):

Left camera projects 3D point $(X, Y, Z)$ to:

\[u_L = f \frac{X}{Z}, \quad v_L = f \frac{Y}{Z}\]

Right camera (shifted by baseline $b$ along X-axis) projects to:

\[u_R = f \frac{X - b}{Z}, \quad v_R = f \frac{Y}{Z}\]

Disparity is defined as:

\[d = u_L - u_R = f \frac{X}{Z} - f \frac{X - b}{Z} = \frac{fb}{Z}\]

Therefore:

\[Z = \frac{fb}{d}\]

Key Observations:

Depth $Z$ is inversely proportional to disparity $d)
Larger baseline $b$ → larger disparity → better depth resolution
Depth uncertainty grows quadratically with distance: $\sigma_Z \propto Z^2)

🎮 Interactive Exploration: Want to build intuition for this relationship? Check out our Interactive Depth vs Disparity Visualization where you can:

Adjust baseline, focal length, and disparity in real-time

See 3D stereo camera geometry from multiple viewpoints

Explore the hyperbolic depth-disparity curve

Understand why depth uncertainty grows with distance

Experiment with different stereo camera configurations

Stereo Rectification

Rectification transforms stereo images so that:

Epipolar lines are horizontal and parallel to image rows
Corresponding points have the same $v$ coordinate
Correspondence search reduces to 1D horizontal search

Algorithm:

Compute fundamental matrix $\mathbf{F}$ from point correspondences
Compute rectification homographies $\mathbf{H}_L, \mathbf{H}_R)
Warp both images: $I’_L = \mathbf{H}_L(I_L)$, $I’_R = \mathbf{H}_R(I_R))

After rectification, for pixel $(u_L, v)$ in left image, the correspondence is at $(u_R, v)$ in right image where $u_R < u_L$.

Correspondence Problem

The correspondence problem is finding which pixel in the right image corresponds to a given pixel in the left image.

Challenges:

Ambiguity: Multiple similar-looking regions
Occlusions: Point visible in one view but not the other
Textureless regions: No distinctive features
Specular reflections: Appearance changes between views
Repetitive patterns: Many plausible matches

Cost Functions measure similarity between patches:

Sum of Absolute Differences (SAD): $\sum \lvert I_L(u+i, v+j) - I_R(u-d+i, v+j) \rvert)
Sum of Squared Differences (SSD): $\sum (I_L(u+i, v+j) - I_R(u-d+i, v+j))^2)
Normalized Cross-Correlation (NCC): More robust to illumination changes
Census Transform: Encodes local structure, robust to illumination
Mutual Information: Information-theoretic measure

Stereo Matching Algorithms

Stereo matching algorithms can be categorized by how they optimize the matching cost.

Local Methods

Local methods compute disparity independently for each pixel based on a local window.

Winner-Takes-All (WTA):

\[d^*(u, v) = \arg\min_{d \in [d_{\min}, d_{\max}]} C(u, v, d)\]

where $C(u, v, d)$ is the matching cost for pixel $(u, v)$ at disparity $d$.

Advantages:

Fast computation (parallelizable)
Simple implementation
Low memory requirements

Disadvantages:

Noisy in low-texture regions
No global consistency
Window size trade-off: small → noisy, large → blurred boundaries

Global Methods

Global methods formulate stereo matching as an energy minimization problem:

\[E(D) = E_{\text{data}}(D) + \lambda E_{\text{smooth}}(D)\]

where:

$E_{\text{data}}(D) = \sum_{(u,v)} C(u, v, D(u, v))$: Data term (matching cost)
$E_{\text{smooth}}(D) = \sum_{(u,v) \sim (u’,v’)} V(D(u, v), D(u’, v’))$: Smoothness term
$\lambda$: Regularization weight

Common Smoothness Terms:

L1: $V(d_p, d_q) = \lvert d_p - d_q \rvert$ (preserves discontinuities)
L2: $V(d_p, d_q) = (d_p - d_q)^2$ (smooth but blurs edges)
Potts model: $V(d_p, d_q) = [d_p \neq d_q]$ (piecewise constant)
Truncated linear: $V(d_p, d_q) = \min(\lvert d_p - d_q \rvert, \tau))

Optimization Methods:

Graph cuts: Efficiently solves certain energy functions
Belief propagation: Message-passing on graphical models
Dynamic programming: Scanline optimization (ignores vertical coherence)
Variational methods: Continuous optimization with PDEs

Advantages:

Smoother, more coherent depth maps
Better handles occlusions
Enforces geometric constraints

Disadvantages:

Computationally expensive
May over-smooth depth discontinuities

Semi-Global Matching

Semi-Global Matching (SGM) by Hirschmüller (2008) balances accuracy and efficiency.

Key Idea: Approximate global optimization by aggregating costs along multiple 1D paths.

Algorithm:

Compute pixel-wise matching cost $C(u, v, d))
For each pixel and each of $r$ directions (typically 8 or 16): $L_r(u, v, d) = C(u, v, d) + \min \begin{cases} L_r(u-r_u, v-r_v, d) \\ L_r(u-r_u, v-r_v, d \pm 1) + P_1 \\ \min_i L_r(u-r_u, v-r_v, i) + P_2 \end{cases}$
Aggregate costs: $S(u, v, d) = \sum_r L_r(u, v, d))
WTA: $d^*(u, v) = \arg\min_d S(u, v, d))

Penalty Terms:

$P_1$: Small penalty for ±1 disparity change
$P_2$: Large penalty for large disparity changes

Advantages:

Near real-time on modern CPUs
High accuracy (used in autonomous vehicles)
Handles textureless regions well

Deep Learning for Depth Estimation

The deep learning revolution has transformed depth estimation, enabling previously impossible capabilities.

Monocular Depth Estimation

Monocular depth estimation predicts depth from a single image—a task impossible with classical geometry alone. Deep networks learn depth cues from data:

Geometric Cues:

Perspective (parallel lines converge)
Occlusion (foreground blocks background)
Relative size (known object sizes)

Photometric Cues:

Atmospheric perspective (distant objects are hazy)
Defocus blur (depth-of-field effects)
Shading and shadows

Early Architectures (2014-2016):

Eigen et al. (2014) pioneered end-to-end CNN depth estimation:

Multi-scale architecture: coarse + fine networks
Loss functions: depth, gradient, and normal losses

Network Structure:

Input RGB → CNN Encoder → Multi-Scale Decoder → Depth Map

Modern Architectures (2017-present):

Encoder-Decoder with Skip Connections:

Encoder: ResNet, EfficientNet, or Swin Transformer
Decoder: Progressive upsampling with skip connections
Similar to semantic segmentation networks (U-Net style)

Example: DPT (Dense Prediction Transformer, 2021):

Input → Vision Transformer Encoder → Reassemble → Convolutional Decoder → Depth

Loss Functions:

Scale-Invariant Loss (Eigen et al.): $\mathcal{L}_{\text{si}} = \frac{1}{n}\sum_i d_i^2 - \frac{\lambda}{n^2}\left(\sum_i d_i\right)^2$ where $d_i = \log \hat{Z}_i - \log Z_i$ and $\lambda = 0.5)
Gradient Loss (preserves depth boundaries): $\mathcal{L}_{\text{grad}} = \sum_i \lvert \nabla_x \hat{D}_i - \nabla_x D_i \rvert + \lvert \nabla_y \hat{D}_i - \nabla_y D_i \rvert$
Multi-Scale Loss (captures both local and global structure): $\mathcal{L}_{\text{ms}} = \sum_{s=1}^{4} \alpha_s \mathcal{L}(\hat{D}^s, D^s)$

Self-Supervised Learning

Obtaining ground truth depth at scale is expensive. Self-supervised methods leverage geometry and photometric consistency.

Monodepth2 (Godard et al., 2019):

Training Setup:

Input: Stereo image pairs or video sequences
Predict: Depth map $D$ and ego-motion $T$ (for video)
No ground truth depth required!

Photometric Loss:

The key insight: if depth is correct, we can warp one view to another and it should match.

Given left image $I_L$, predicted depth $D_L$, we can synthesize right image:

\[\hat{I}_R(u, v) = I_L\left(u + \frac{bf}{D_L(u, v)}, v\right)\]

Loss function:

\[\mathcal{L}_{\text{photo}} = \sum_{u,v} \rho\left(I_R(u, v), \hat{I}_R(u, v)\right)\]

where $\rho$ is a robust photometric loss (often SSIM + L1).

Additional Components:

Minimum reprojection loss: Handles occlusions
Auto-masking: Ignores static camera frames
Multi-scale prediction: Improves boundary accuracy

Advantages:

No LiDAR or depth sensors needed
Scalable to massive datasets (YouTube videos!)
Learns realistic depth priors

Challenges:

Scale ambiguity (depth only up to unknown scale)
Struggles with non-Lambertian surfaces
Occlusions and dynamic objects cause issues

Multi-View Stereo Networks

Multi-View Stereo (MVS) networks leverage multiple viewpoints with known camera poses.

MVSNet (Yao et al., 2018):

Key Innovation: Build a 3D cost volume in a learned feature space.

Algorithm:

Extract 2D features from each view using CNN: ${F_1, \ldots, F_N})
For each reference view and depth hypothesis $d$:
- Warp features from other views to reference view at depth $d)
- Compute similarity (variance or learned metric)
Build 3D cost volume $C(u, v, d))
Regularize cost volume with 3D CNN
Regress depth with soft argmin: $\hat{d}(u, v) = \sum_d d \cdot \text{softmax}(C(u, v, d))$

Advantages:

Explicitly encodes multi-view geometry
Handles wide baselines
State-of-the-art reconstruction quality

Applications:

3D scanning
Novel view synthesis (Neural Radiance Fields)
Autonomous driving

Transformer-Based Approaches

Recent work leverages Vision Transformers for depth estimation.

Key Advantages:

Global receptive field (vs. local CNNs)
Better long-range dependencies
State-of-the-art performance

DPT (Dense Prediction Transformer, 2021):

Pre-trained Vision Transformer (ViT) as encoder
Convolutional decoder with skip connections
Achieves best results on multiple benchmarks

Adabins (2021):

Adaptive binning for depth range
Transformer module for range attention
Handles both indoor and outdoor scenes

MIM (Masked Image Modeling) Pre-training:

Pre-train on masked image reconstruction (like BERT)
Fine-tune for depth estimation
Improves data efficiency

Applications of Depth Maps

Depth maps enable a vast array of applications across computer vision and robotics.

1. Autonomous Driving

Depth Estimation is critical for:

Obstacle detection: Identify pedestrians, vehicles, barriers
Path planning: Compute drivable space
3D object detection: Estimate size and position of objects
Scene understanding: Semantic segmentation + depth

Examples: Tesla Autopilot, Waymo, Cruise use multi-camera depth estimation

2. Augmented Reality (AR)

Depth Maps enable realistic AR:

Occlusion handling: Virtual objects appear behind real objects
Collision detection: Virtual objects interact with real surfaces
Spatial understanding: Place objects on floors, walls
Hand tracking: Depth improves gesture recognition

Examples: Apple ARKit, Google ARCore, Meta Quest

3. 3D Reconstruction

Structure from Motion (SfM) and SLAM use depth:

3D scanning: Reconstruct objects and environments
Photogrammetry: Create 3D models from images
Cultural heritage: Digitize artifacts and monuments
VFX and gaming: Create realistic 3D assets

4. Robotics

Depth Perception is essential for:

Grasping: Estimate object pose and shape
Navigation: Build occupancy maps
Manipulation: Plan collision-free paths
Human-robot interaction: Maintain safe distances

5. Computational Photography

Depth-Based Effects:

Portrait mode: Realistic bokeh (background blur)
Refocusing: Change focus after capture
3D photos: Parallax effects (Facebook 3D Photos)
Lighting adjustment: Relight scenes based on geometry

6. Medical Imaging

Depth Reconstruction in:

Endoscopy: 3D structure of internal organs
Surgery planning: Anatomical modeling
Prosthetics: Custom-fit devices

Challenges and Future Directions

Despite remarkable progress, depth estimation still faces significant challenges.

Current Limitations

Transparent and Reflective Surfaces
- Glass, mirrors, water violate standard assumptions
- Specular reflections confuse stereo matching
- Recent work: physics-based models, polarization
Textureless Regions
- Large uniform areas (walls, sky) lack features
- CNNs help by learning priors, but still struggle
- Structured light and ToF work better
Scale Ambiguity in Monocular Depth
- Single-image depth is only up to scale
- Need absolute metric depth for robotics
- Solutions: multi-task learning with scale supervision
Generalization Across Domains
- Models trained on outdoor scenes fail indoors
- Different camera parameters require retraining
- Domain adaptation and meta-learning help
Real-Time Performance
- High-resolution depth at video rates is demanding
- Mobile deployment requires model compression
- Specialized hardware (neural accelerators) emerging

Emerging Directions

1. Neural Radiance Fields (NeRF) and 3D Gaussians

Represent scenes as continuous functions
Jointly optimize geometry and appearance
Enable photorealistic novel view synthesis

2. Depth Completion

Fuse sparse LiDAR with dense RGB
Best of both worlds: accurate + dense
Critical for autonomous driving

3. Learned Multi-View Stereo

Replace hand-crafted matching costs with learned features
Adaptive cost aggregation
Hybrid differentiable optimization

4. Event Cameras for Depth

Neuromorphic sensors with microsecond latency
Ideal for high-speed motion
Research frontier: event-based stereo

5. Depth from Defocus

Use lens optics and focus variation
Single-camera solution
Emerging in smartphone photography

6. Foundation Models for 3D

Pre-train on massive 3D datasets
Zero-shot depth estimation
Universal scene understanding (Depth Anything, ZoeDepth)

Key Takeaways

Depth maps encode 3D structure as 2D images where pixel values represent distance from the camera—fundamental for understanding scene geometry.
Classical methods rely on geometric principles:
- Stereo vision triangulates depth from two viewpoints
- Structured light and ToF actively sense distance
- Accuracy depends on calibration and scene properties
Stereo matching is an optimization problem balancing data fidelity and smoothness, with Semi-Global Matching providing an excellent speed-accuracy trade-off.
Deep learning revolutionized depth estimation:
- Monocular networks predict depth from single images by learning 3D priors
- Self-supervised methods train without ground truth depth
- Transformers achieve state-of-the-art by capturing global context
Applications span autonomous driving, AR, robotics, computational photography, and 3D reconstruction—depth is foundational to spatial computing.
Future directions include neural 3D representations (NeRF), foundation models for universal depth estimation, and hybrid approaches combining geometric constraints with learned priors.

Depth estimation bridges classical computer vision geometry with modern deep learning, demonstrating how first principles and data-driven methods can complement each other to solve fundamental perception problems.

Keep Reading

What Actually Happens When You Set Model Weights to Zero (and Why Gradients Still Work) February 14, 2026 · machine-learning, deep-learning, autograd, pytorch
Learning Rate Schedulers: Intuition, Tradeoffs, and When to Use Which February 3, 2026 · machine-learning, optimization, deep-learning
Video Matting: Temporal Consistency and Real-Time Foreground Extraction January 17, 2026 · computer-vision, video-matting, temporal-consistency, optical-flow, deep-learning, real-time-processing, video-processing
Image Matting: Estimating Accurate Mask Edges for Professional Compositing January 16, 2026 · computer-vision, image-matting, alpha-matting, segmentation, compositing, deep-learning, image-processing
Interactive Depth vs Disparity Visualization January 15, 2026 · computer-vision, stereo-vision, depth-estimation, interactive, visualization, 3d-geometry

Table of Contents

Introduction: The Challenge of Depth Perception

What is a Depth Map?

Representation Formats

Depth vs Disparity

Classical Approaches to Depth Estimation

Stereo Vision

Structured Light

Time-of-Flight Sensors

Photometric Stereo

The Mathematics of Stereo Vision

Epipolar Geometry

The Disparity-Depth Relationship

Stereo Rectification

Correspondence Problem

Stereo Matching Algorithms

Local Methods

Global Methods

Semi-Global Matching

Deep Learning for Depth Estimation

Monocular Depth Estimation

Self-Supervised Learning

Multi-View Stereo Networks

Transformer-Based Approaches

Applications of Depth Maps

1. Autonomous Driving

2. Augmented Reality (AR)

3. 3D Reconstruction

4. Robotics

5. Computational Photography

6. Medical Imaging

Challenges and Future Directions

Current Limitations

Emerging Directions

Key Takeaways

Further Reading

Seminal Papers

Textbooks

Datasets and Benchmarks

Online Resources

Keep Reading