Pronunciation feedback is essential for teaching languages to children, urging the need to create computer-assisted pronunciation training (CAPT) systems to automate this process. Most of the current CAPT systems for Arabic Language examined a wide age range, including children, with a dominant adults’ portion. However, these systems did not consider many challenges facing children’s voice, in terms of the acoustic difference between children and adults. Moreover, due to the lack of publicly Arabic datasets, Arabic-related systems were evaluated on datasets consisting of the sound of Arabic letters, with minor efforts to contain words or sentences. In this paper, we propose the Arabic Utterance Mispronunciation Detector for Children (AUMD-Child) system that detects the mispronunciation of children for Arabic words at real-time. The proposed system is based on the fusion of Vision Transformer (ViT) and transfer learning-based model (AlexNet). A novel public dataset is constructed, in which 16 words have been collected from children aged between 7 and 12 years old to evaluate our proposed system. The experimental results, compared to the state-of-the-art models in literature, showed that the accuracy of our proposed AUMD-Child system exceeded the handcrafted features-based models by an average of 7 %, the CNN features-based models using AlexNet and SVM by 5 %, transfer learning-based model using AlexNet by 4 %, and transformers-based model using ViT by 2 %, achieving an average accuracy of 91.81 % for detecting the mispronunciation patterns in Arabic words but in an average processing time of 33 ms, exceeding the other state-of-the-art models with 13 %, 0.03 %, 0.06 %, and 23 % respectively.