The application of several neural network architectures—including Fully Connected Networks (FCN), Convolutional Neural Networks (CNN), pretrained ResNet, Vision Transformer (ViT-B-16)—for the classification of hand X-ray images into "Fractured" and "Not Fractured"—categories is investigated in this work. The main goals are to evaluate these models' fracture detection ability and determine which architectural design fits this work. Because transfer learning let the model use past information from big-scale picture datasets, the pretrained ResNet model emerged as the most effective with high accuracy, stability, and resilience. The bespoke CNN also performed well, displaying excellent feature extraction powers especially for medical imaging. But the non-pretrained ResNet model overfitted, meaning deeper networks find it difficult to generalize without pretraining. Though innovative, the Vision Transformer performed poorly since it depends on a lot of training data and finds difficult learning of intricate spatial properties from little datasets. Although acting as a baseline, the FCN's simple architecture and incapacity to detect spatial hierarchies in images meant it could not match the efficacy of CNN models. Emphasizing the important function of transfer learning in clinical applications, the results show that pretrained CNN architectures, especially ResNet, offer the most consistent and accurate method for automatic fracture diagnosis in medical pictures.