Abstract

With the progress of technology, modern industrial monitoring data not only includes traditional process data but also includes video data. In order to make full use of these multi-source heterogeneous data to achieve better accuracy of fault classification, feature extraction and feature fusion of different modalities is vital important. For feature extraction, because the contribution rate of process data at different times is different to fault classification, in order to obtain the temporal features that have a great impact on fault classification, the Batch Normalization Long Short-Term Memory based on Self-Attention mechanism (BN-LSTM-SA) is used for feature extraction of process data. Since temporal and spatial features are important indicators of video classification, the Two Stream Shifted Windows 3D Convolution Transformer (TSSCT) model is used to extract temporal and spatial features. Cross Multi-head Self-Attention(CMSA) model is designed for feature fusion, which fully utilize the correlation and complementarity of multi-source heterogeneous information and can reasonably allocate the weight of different modalities. Experiments show that our proposed method is superior to the unimodal model or the existing multimodal fusion model and prove the effectiveness of the proposed method.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call