A multi‐modal fusion YoLo network for traffic detection

Xinwang Zheng,Wenjie Zheng,Chujie Xu

doi:10.1111/coin.12615

Abstract

AbstractTraffic detection (including lane detection and traffic sign detection) is one of the key technologies to realize driving assistance system and auto drive system. However, most of the existing detection methods are designed based on single‐modal visible light data, when there are dramatic changes in lighting in the scene (such as insufficient lighting in night), it is difficult for these methods to obtain good detection results. In view of multi‐modal data can provide complementary discriminative information, based on the YoLoV5 model, this paper proposes a multi‐modal fusion YoLoV5 network, which consists of three key components: the dual stream feature extraction module, the correlation feature extraction module, and the self‐attention fusion module. Specifically, the dual stream feature extraction module is used to extract the features of each of the two modalities. Secondly, input the features learned from the dual stream feature extraction module into the correlation feature extraction module to learn the features with maximum correlation. Then, the extracted maximum correlation features are used to achieve information exchange between modalities through a self‐attention mechanism, and thus obtain fused features. Finally, the fused features are inputted into the detection layer to obtain the final detection result. Experimental results on different traffic detection tasks can demonstrate the superiority of the proposed method.

Full Text