Breast cancer ranks as the second most prevalent cancer in women, recognized as one of the most dangerous types of cancer, and is on the rise globally. Regular screenings are essential for early-stage treatment. Digital mammography (DM) is the most recognized and widely used technique for breast cancer screening. Contrast-Enhanced Spectral Mammography (CESM or CM) is used in conjunction with DM to detect and identify hidden abnormalities, particularly in dense breast tissue where DM alone might not be as effective. In this work, we explore the effectiveness of each modality (CM, DM, or both) in detecting breast cancer lesions using deep learning methods. We introduce an architecture for detecting and classifying breast cancer lesions in DM and CM images in Craniocaudal (CC) and Mediolateral Oblique (MLO) views. The proposed architecture (JointNet) consists of a convolution module for extracting local features, a transformer module for extracting long-range features, and a feature fusion layer to fuse the local features, global features, and global features weighted based on the local ones. This significantly enhances the accuracy of classifying DM and CM images into normal or abnormal categories and lesion classification into benign or malignant. Using our architecture as a backbone, three lesion classification pipelines are introduced that utilize attention mechanisms focused on lesion shape, texture, and overall breast texture, examining the critical features for effective lesion classification. The results demonstrate that our proposed methods outperform their components in classifying images as normal or abnormal and mitigate the limitations of independently using the transformer module or the convolution module. An ensemble model is also introduced to explore the effect of each modality and each view to increase our baseline architecture's accuracy. The results demonstrate superior performance compared with other similar works. The best performance on DM images was achieved with the semi-automatic AOL Lesion Classification Pipeline, yielding an accuracy of 98.85 %, AUROC of 0.9965, F1-score of 98.85 %, precision of 98.85 %, and specificity of 98.85 %. For CM images, the highest results were obtained using the automatic AOL Lesion Classification Pipeline, with an accuracy of 97.47 %, AUROC of 0.9771, F1-score of 97.34 %, precision of 94.45 %, and specificity of 97.23 %. The semi-automatic ensemble AOL Classification Pipeline provided the best overall performance when using both DM and CM images, with an accuracy of 94.74 %, F1-score of 97.67 %, specificity of 93.75 %, and sensitivity of 95.45 %. Furthermore, we explore the comparative effectiveness of CM and DM images in deep learning models, indicating that while CM images offer clearer insights to the human eye, our model trained on DM images yields better results using Attention on Lesion (AOL) techniques. The research also suggests a multimodal approach using both DM and CM images and ensemble learning could provide more robust classification outcomes.