With the exponential growth of file data in the multimedia era, file retrieval ability to achieve effective data management has become a hot research field. Based on people’s English file search needs, this paper proposes an English multimodal file search model based on transformer. Through ablation experiments on two public data sets and comparison experiments with the benchmark model, the effectiveness and superiority of the proposed transformers algorithm model in multi-modal data processing are verified. The multi-modal fusion retrieval system can usually achieve better performance than the single-modal retrieval system. This experiment focuses on three modes: Audio, Image and Text. The experimental results show that the proposed method can not only improve the efficiency of file search, but also extract modal features and perform feature fusion better. In the future, we can further explore different types of other attention mechanisms or integrate a variety of different architectures to further enhance the feasibility and superiority of multimodal file search
Read full abstract