Abstract Coral reefs are among the most diverse ecosystems on our planet, and essential to the livelihood of hundreds of millions of people who depend on them for food security, income from tourism and coastal protection. Unfortunately, most coral reefs are existentially threatened by global climate change and local anthropogenic pressures. To better understand the dynamics underlying deterioration of reefs, monitoring at high spatial and temporal resolution is key. However, conventional monitoring methods for quantifying coral cover and species abundance are limited in scale due to the extensive manual labor required. Although computer vision tools have been employed to aid in this process, in particular structure‐from‐motion (SfM) photogrammetry for 3D mapping and deep neural networks for image segmentation, analysis of the data products creates a bottleneck, effectively limiting their scalability. This paper presents a new paradigm for mapping underwater environments from ego‐motion video, unifying 3D mapping systems that use machine learning to adapt to challenging conditions under water, combined with a modern approach for semantic segmentation of images. The method is exemplified on coral reefs in the northern Gulf of Aqaba, Red Sea, demonstrating high‐precision 3D semantic mapping at unprecedented scale with significantly reduced required labor costs: given a trained model, a 100 m video transect acquired within 5 min of diving with a cheap consumer‐grade camera can be fully automatically transformed into a semantic point cloud within 5 min. We demonstrate the spatial accuracy of our method and the semantic segmentation performance (of at least 80% total accuracy), and publish a large dataset of ego‐motion videos from the northern Gulf of Aqaba, along with a dataset of video frames annotated for dense semantic segmentation of benthic classes. Our approach significantly scales up coral reef monitoring by taking a leap towards fully automatic analysis of video transects. The method advances coral reef transects by reducing the labor, equipment, logistics, and computing cost. This can help to inform conservation policies more efficiently. The underlying computational method of learning‐based Structure‐from‐Motion has broad implications for fast low‐cost mapping of underwater environments other than coral reefs.