Feature Extraction Method of Machine Translation Equivalent Pairs in Chinese-English Comparable Corpus based OCR Recognition

Bo Wang

doi:10.1109/icoei51242.2021.9452871

Abstract

With the development of natural language processing and text mining technology, it has become a trend to mine and extract corresponding knowledge from unstructured text. Contrast is two or more corpora composed of texts of different languages or texts of different variants of the same language. Analogical corpora can also be subdivided into monolingual and bilingual/multilingual corpora. The former collects texts with similar content in a similar language environment, while the latter collects texts in different languages with similar content, register and communicative environment, which are mostly used in contrastive linguistics. Optical character recognition (OCR) is now mainly used in document recognition and certificate recognition. Deep learning can improve the application scope of OCR recognition. Text region extraction applied to OCR can enhance. the accuracy of OCR text extraction and improve the accuracy of OCR. This paper studies the feature extraction method of machine translation equivalent pair for OCR recognition based on Chinese English comparable corpus.

Full Text