Virtual reality (VR) can potentially enhance student engagement and memory retention in the classroom. However, distraction among participants in a VR-based classroom is a significant concern. Several factors, including mind wandering, external noise, stress, etc., can cause students to become internally and/or externally distracted while learning. To detect distractions, single or multi-modal features can be used. A single modality is found to be insufficient to detect both internal and external distractions, mainly because of individual variability. In this work, we investigated multi-modal features: eye tracking and EEG data, to classify the internal and external distractions in an educational VR environment. We set up our educational VR environment and equipped it for multi-modal data collection. We implemented different machine learning (ML) methods, including k-nearest-neighbors (kNN), Random Forest (RF), one-dimensional convolutional neural network - long short-term memory (1 D-CNN-LSTM), and two-dimensional convolutional neural networks (2D-CNN) to classify participants' internal and external distraction states using the multi-modal features. We performed cross-subject, cross-session, and gender-based grouping tests to evaluate our models. We found that the RF classifier achieves the highest accuracy over 83% in the cross-subject test, around 68% to 78% in the cross-session test, and around 90% in the gender-based grouping test compared to other models. SHAP analysis of the extracted features illustrated greater contributions from the occipital and prefrontal regions of the brain, as well as gaze angle, gaze origin, and head rotation features from the eye tracking data.