Zero-shot text classification does not require large amounts of labeled data and is designed to handle text classification tasks that lack annotated training data. Existing zero-shot text classification uses either a text-text matching paradigm or a text-image matching paradigm, which shows good performance on different benchmark datasets. However, the existing classification paradigms only consider a single modality for text matching, and little attention is paid to the help of multimodality for text classification. In order to incorporate multimodality into zero-shot text classification, we propose a multimodal enhanced CLIP framework (CLIPMulti), which employs a text-image&text matching paradigm to enhance the effectiveness of zero-shot text classification. Three different image and text combinations are tested for their effects on zero-shot text classification, and a matching method (Match-CLIPMulti) is further proposed to find the corresponding text based on the classified images automatically. We conducted experiments on seven publicly available zero-shot text classification datasets and achieved competitive performance. In addition, we analyzed the effect of different parameters on the Match-CLIPMulti experiments. We hope this work will bring more thoughts and explorations on multimodal fusion in language tasks.
Read full abstract