In today's information age, information is gathered from text and more complex media, such as images, audio, and video. Among these data sources, the rapid growth of video information has led to it to gradually become the main source of information in people's lives. Video information is characterized by many kinds of information, complex forms, and a low degree of structure. Therefore, effectively classifying, managing and retrieving video information has become a difficult problem to solve. In this paper, an improved crow search algorithm is used to process video images, and the information entropy is used to extract the key frames, which reduces the computation burden of each frame feature calculation and feature contrast process, thus shortening the key frame detection time. Then, all the feature sets are extracted and used as input for an HMM according to the observed sequence $$O = O_{1} ,O_{2} ,O_{3} , \cdot \cdot \cdot ,O_{T}$$ of the input image or video data and the initial model parameters $$\lambda = (\pi ,A,B)$$ . According to the training rules, the model parameters are repeatedly adjusted and modified, and the new model $$\overline{\lambda }$$ is constructed step by step to realize the retrieval of multimedia images and videos. The experimental results show that the method has obvious advantages in terms of the retrieval time and retrieval effect and provides new ideas for multimedia image and video retrieval.