This paper presents a single-chip, high-performance, and energy-efficient stereo vision depth-estimation processor for micro aerial vehicles (MAVs). The proposed processor implements the state-of-the-art semi-global matching (SGM) algorithm to deliver full high-definition (HD, 1920 ${\times }$ 1080) stereo-depth outputs with a maximum of 38 frames/s throughput. Algorithm-architecture co-optimization is conducted, introducing overlapping block-based processing that eliminates very large on-chip memory and off-chip DRAM. We exploit inherent data parallelism in the algorithm by processing 128 local disparity costs and aggregating the SGM costs along four paths for all 128 disparities in parallel. A dependence-resolving scan associated with 16-stage deep pipeline is introduced to hide the data dependence between neighboring pixels in the SGM algorithm. Moreover, we propose a customized ultra-high bandwidth dual-port SRAM that utilizes the unique memory access characteristic of SGM to achieve highly energy-efficient memory access at a very high on-chip memory bandwidth of 1.64 Tb/s. The fabricated processor produces 512 levels of depth information for each pixel at full HD resolution with 30-frames/s performance, consuming 836 mW from a 0.75-V supply in TSMC 40-nm GP CMOS. We ported the design on a quadcopter MAV to demonstrate its performance in realistic real-time flight.