Websites frequently use text-based captcha images to distinguish whether the user is a person or not. Previous research mainly focuses on different training strategies and neglects the characteristics of the text-based captcha images themselves, resulting in low accuracy. For text-based captcha images characterized by rotation, distortion, and non-character elements, we propose an end-to-end attack using a Transformer-based method with triplet deep attention. Firstly, the features of text-based captchas are extracted using ResNet45 with triplet deep attention module and Transformer encoder. The TDA module is capable of learning rotational and distortion features of characters. Subsequently, based on self-attention mechanism, design query, key, and value, and adopt the query enhancement module to enhance the query. The query enhancement module can strengthen character localization and reduce attention drift towards non-character elements. Finally, the feature maps are transformed into probabilities of character for the final text recognition. Experiments are conducted on captcha datasets based on Roman characters from 9 popular websites, achieving average word accuracy of 91.14%. To evaluate the performance of our method on data with small samples, experiments are conducted different scales of training data. Additionally, we use the method on Chinese text-based captcha tasks and achieve average word accuracy of 99.60%. The effectiveness of the method is also explored under conditions of lack of illumination and scene text recognition, where background interference is present.
Read full abstract