Efficient Language Identification for All-Language Internet News

Jian Tang,Xiaojiang Chen,Wuying Liu

doi:10.1109/ialp54817.2021.9675270

Abstract

The rapid development of language science and computing technology, especially the popularization of broadband Internet, has caused the explosion of all-language news to spread and communicate faster and faster. Among multi-modal news such as text, image, audio, and video, text news still accounts for the largest proportion of Internet news. In the face of more than 7,000 existing human languages, efficiently identifying the language of text news has become the most basic natural language processing technology, which can select accurate language processing methods for subsequent in-depth content processing and network public opinion analysis. Based on the idea of N-Gram, we designed and implemented a set of language identification methods suitable for all-language Internet news from two aspects: language training and language identification, and applied it to actual text news preprocessing. The language identification results of all-language Internet news show that our method has good recognition accuracy and efficiency.

Full Text