Abstract

We propose a multi-task learning framework for improving the performance of vision-based deep-learning approaches for driver distraction recognition. The most popular tool so far for solving this task is convolutional neural networks (CNNs) that have proven to be strongly biased toward local features. Such bias causes CNNs to neglect global structural information, adversely affecting the robustness of the distracted driver recognition task. To solve this problem, we generate positive and negative samples of each given input, and construct a triplet of images (i.e., raw image, positive sample, and negative sample). The positive sample is generated by applying structure-aware illumination to the human body region of each given input. The negative sample is generated by randomly shuffling the local regions of each given input. The networks are then trained with the triplets using a multi-task learning strategy to force the networks to explore global information by multiple tasks: (a) recognizing the raw input and positive sample as the given ground truth; (b) recognizing the negative sample as an extra “meaningless” label; (c) pulling closer the distance between the features obtained from the raw input and positive sample while pushing away the distance between the features obtained from the raw input and negative sample. By doing so, the model can be trained so that it neglects the background information and pays more attention to the global structual information of the scene. The proposed approach reaches state-of-the-art performance on the AUC Distracted Driver Dataset and performs better than state-of-the-art studies on the Drive and Act Dataset. With raw images as input, we have achieved an accuracy of 96.0% for the AUC distracted driver dataset and 66.8% for the Drive and Act Dataset. Our approach does not introduce extra overhead during the testing procedure (i.e., utilization procedure), which is helpful for real-life applications. Moreover, better accuracy can be achieved by fusing the predictions respectively obtained with the raw input and positive sample. As a result, we have achieved an accuracy of 96.3% for the AUC distracted driver Dataset and 66.9% for the Drive and Act Dataset. The class activation map (CAM) of our proposed method is subjectively more reasonable, which would enhance the reliability and explainability of the model.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call