This study explores the application of multimodal data mining in text and image vector processing, aiming to improve the depth and breadth of data analysis by integrating information from different data types. We use the FairFace dataset combined with the CLIP model encoding layer to obtain text and image vectors, and use the K-Means clustering algorithm to achieve vector dimensionality reduction. Subsequently, we introduced the bipartite graph matching algorithm to achieve maximum matching between text vectors and image vectors, and calculated the contrastive learning loss and similarity loss. The entire process covers steps such as data preparation, feature extraction, vector dimensionality reduction, matching algorithms, and loss assessment, constructing a complete text and image matching task process. Our research contributions include using K-Means clustering algorithm to achieve vector dimensionality reduction, as well as introducing bipartite graph matching algorithm and calculating two types of losses in text and image vector matching, further improving matching quality.