The main purpose of a detection system is to ascertain the state of an individual’s eyes, whether they are open and alert or closed, and then alert them to their level of fatigue. As a result of this, they will refrain from approaching an accident site. In addition, it would be advantageous for people to be promptly alerted in real time before the occurrence of any calamitous events affecting multiple people. The implementation of Internet-of-Things (IoT) technology in driver action recognition has become imperative due to the ongoing advancements in Artificial Intelligence (AI) and deep learning (DL) within Advanced Driver Assistance Systems (ADAS), which are significantly transforming the driving encounter. This work presents a deep learning model that utilizes a CNN–Long Short-Term Memory network to detect driver sleepiness. We employ different algorithms on datasets such as EM-CNN, VGG-16, GoogLeNet, AlexNet, ResNet50, and CNN-LSTM. The aforementioned algorithms are used for classification, and it is evident that the CNN-LSTM algorithm exhibits superior accuracy compared to alternative deep learning algorithms. The model is provided with video clips of a certain period, and it distinguishes the clip by analyzing the sequence of motions exhibited by the driver in the video. The key objective of this work is to promote road safety by notifying drivers when they exhibit signs of drowsiness, minimizing the probability of accidents caused by fatigue-related disorders. It would help in developing an ADAS that is capable of detecting and addressing driver tiredness proactively. This work intends to limit the potential dangers associated with drowsy driving, hence promoting enhanced road safety and a decrease in accidents caused by fatigue-related variables. This work aims to achieve high efficacy while maintaining a non-intrusive nature. This work endeavors to offer a non-intrusive solution that may be seamlessly integrated into current automobiles, hence enhancing accessibility to a broader spectrum of drivers through the utilization of facial movement analysis employing CNN-LSTM and a U-Net-based architecture.