Structure from Motion (SfM) is key to mixed computer vision and photogrammetry applications. However, the fast-growing needs for large-scale SfM bring challenges to current SfM solutions. Unlike traditional global and incremental SfM solutions, hierarchical SfM approaches demonstrate promising potential in effectively reconstructing large-scale image sets by dividing the image set into multiple image clusters, reconstructing each cluster separately, and gradually merging partial models into a complete model. However, current hierarchical SfM approaches still suffer from the following problems: accurate image clustering without ancillary information; automatic quality evaluation of each reconstruction unit and unreliable partial reconstruction removal; effective and efficient reconstruction of each image cluster; robust and accurate cluster merging considering the merging order and the handling of images taken with the same camera but divided into different clusters. These unstable factors limit the robustness and accuracy of hierarchical SfM approaches on different unstructured image sets.To systematically improve the performance of hierarchical SfM, we propose a novel robust hierarchical structure from motion (RHSfM) method for large-scale image sets, which does not rely on any additional information, such as Global Positioning System (GPS) and Inertial Navigation System (INS). (1) We develop an automatic image clustering method based on image correlation and present a dynamic adjustment strategy, obtaining reliable image clustering results. (2) We remove the poor reconstructions by introducing multiple quality evaluation standards. (3) We put forward a fast incremental SfM algorithm that optimizes the image adding mode with an image pre-screening strategy and gets rid of the dependence by the proposed dynamic adjustment strategy. (4) We achieve accurate cluster merging by creating an optimal merging list and employing a stepwise global optimization strategy that merges structures first and then cameras. Significantly, the entire process is fully automated with only a few input parameters, and the final result is not sensitive to these parameters.We verify our method on various real image sets that cover different image conditions, different scenes, and different image scales, especially two large-scale image sets with 121,506 and 153,396 images, respectively. The experimental results reveal that our approach outperforms the state-of-the-art SfM systems Colmap, 3DF Samantha, and Metashape in terms of robustness, accuracy, and efficiency. In particular, only our method successfully reconstructed all the seven challenging datasets. For the five datasets that the other systems can also reconstruct, our method obtains the highest accuracy, which is 25 percent better than the best result of the comparable methods on average; for the remaining two datasets, the accuracy of our method is higher than 0.75 pixels. Moreover, the efficiency of our method is about 18, 4.85, and 0.25 times faster than Colmap, 3DF Samantha, and Metashape averagely on the experimental image sets, respectively. After all, our contribution provides a comprehensive and practical solution for large-scale SfM.