AbstractThe scope of conformation space that protein molecules can adopt is a problem of significant interest. Previous studies by other groups have shown that there are stereochemical constraints that confine local protein structures to a limited range of conformations. Furthermore, the results of many groups have demonstrated that the sequence‐to‐structure relationship remains detectable to some extent on a local level. By studying the conformational space of local protein structures, we may obtain more information concerning the constraints on local structural space and the sequence‐to‐structure mapping, hence facilitate ab initio structure prediction. In this study, we propose a novel algorithm that automatically discovers recurrent pentamer structures in proteins.The algorithm starts by applying Expectation‐Maximization (EM) clustering to the distances between non‐adjacent backbone Cα atoms in a large set of pentamer fragments. A rough partition of the conformation space can thus be derived. In the second stage, by applying a split‐and‐merge algorithm, we can obtain a finite number of clusters and guarantee the homogeneity and distinctiveness of each one. Each cluster of protein structures is represented by a centroid structure. The results show that, with 40 major representative structures, we can approximate most of the protein fragments with an error of 0.378 Å. With only 20 types of structures, the fragment structures can still be modeled at 0.44 Å, which is comparable to or better than the performance of previous methods. We term the representatives “building blocks.” On the global level, we demonstrate that by concatenating different combinations of building blocks, we can model whole protein structures at high resolution: a resolution of 2.54 Å can be achieved simply by using 10 types of building blocks. This finding suggests that the study of molecular structures can be hugely simplified using this reduced representation.