As deep learning (DL) gains popularity for its ability to make accurate predictions in various fields, its applications in geosciences are also on the rise. Many studies focus on achieving high accuracy in DL models by selecting models, developing more complex architectures, and tuning hyperparameters. However, the interpretability of these models, or the ability to understand how they make their predictions, is less frequently discussed. To address the challenge of high accuracy but low interpretability of DL models in geosciences, we study rock classification from thin-section photomicrographs of six types of sedimentary rocks, including quartz arenite, feldspathic arenite, lithic arenite, siltstone, oolitic packstone, and dolomite. These rocks’ characteristic framework grains and grain textures are their distinguishing features, such as the rounded or oval ooids in oolitic packstone. We first train regular DL models, such as ResNet-50, on these photomicrographs and achieve an accuracy of over 0.94. However, these models make classifications based on features like cracks, cements, and scale bars, which are irrelevant for distinguishing sedimentary rocks in real-world applications. We then propose an attention-based dual network incorporating both global (overall photomicrograph) and local (distinguishing framework grains) features to address this issue. Our proposed model achieves not only high accuracy (0.99) but also provides interpretable feature extractions. Our study highlights the need to consider interpretability and geological knowledge in developing DL models, in addition to aiming for high accuracy.