Digital Breast Tomosynthesis (DBT) has revolutionized more traditional breast imaging through its three-dimensional (3D) visualization capability that significantly enhances lesion discernibility, reduces tissue overlap, and improves diagnostic precision as compared to conventional two-dimensional (2D) mammography. In this study, we propose an advanced Computer-Aided Detection (CAD) system that harnesses the power of vision transformers to augment DBT's diagnostic efficiency. This scheme uses a neural network to glean attributes from the 2D slices of DBT followed by post-processing that considers features from neighboring slices to categorize the entire 3D scan. By leveraging a transfer learning technique, we trained and validated our CAD framework on a unique dataset consisting of 3,831 DBT scans and subsequently tested it on 685 scans. Of the architectures tested, the Swin Transformer outperformed the ResNet101 and vanilla Vision Transformer. It achieved an impressive AUC score of 0.934 ± 0.026 at a resolution of 384 384. Increasing the image resolution from 224 to 384 not only maintained vital image attributes but also led to a marked improvement in performance (p-value = 0.0003). The Mean Teacher algorithm, a semi-supervised method using both labeled and unlabeled DBT slices, showed no significant improvement over the supervised approach. Comprehensive analyses across different lesion types, sizes, and patient ages revealed consistent performance. The integration of attention mechanisms yielded a visual narrative of the model's decision-making process that highlighted the prioritized regions during assessments. These findings should significantly propel the methodologies employed in DBT image analysis by setting a new benchmark for breast cancer diagnostic precision.