ABSTRACT Text-to-video retrieval systems enable simple and natural video searches, but existing methods often employ single queries with noisy annotation and low-quality descriptions. A novel lightweight hashing with Contrastive self-cross XLNet and Combinational similarity matching is proposed to provide high-quality descriptions and clarity in annotation. Issues in existing text and video processing approaches, such as redundancy in keyword generation and extracting frames without considering optical flow estimation of keyframes, are solved using a novel Textual Labelling and Automated key frame segmentation. The 3D Cuboid Net with Lightweight Hashing is proposed, which selects features and converts them into hash codes using Lightweight LSS-Net based hashing, reducing memory utilization. Additionally, a novel Combinational Attentive Text-Video Retrieval is proposed, which eliminates multiple narrow extrema matching error surfaces and increases retrieval validity. The model efficiently solves video retrieval problems with high accuracy, recall, and sensitivity. This novel approach addresses the limitations of existing text-to-video retrieval methods and improves the overall quality of the retrieval process.