With the advancement of intelligent technology, CNN-based recognition technology has been integrated into public English teaching in universities. This implementation contributes to enhancing teaching quality, nurturing students’ English proficiency, and holds significant educational and practical value. To address the issue of low traditional attendance efficiency in intelligent classrooms for public English teaching, a face recognition model based on CNN recognition technology has been developed. R-CNN is utilized for object detection, along with pyramid pooling and non-maximum suppression to acquire the optimal candidate region for face detection. Furthermore, K-Means clustering is combined to enhance Fast R-CNN, thereby improving detection accuracy. Experimental results demonstrated that among the three networks - Fast R-CNN, Faster R-CNN, and CNN -Faster R-CNN maintained a high recognition rate and exhibited faster convergence speed, showcasing superior overall performance. Specifically, at 500 iterations, the three networks require 23.7 seconds, 26.8 seconds, and 34.2 seconds, respectively. For facial expression recognition, Faster R-CNN achieved the highest recognition rate, indicating its exceptional detection efficiency and potential for aiding teaching management. This study offers novel technical support for public English teaching in intelligent university classrooms, effectively enhancing teaching efficacy and learning experiences. Its practical significance extends to promoting educational reform and improvement.