Abstract

The rapid development of language science and computing technology, especially the popularization of broadband Internet, has caused the explosion of all-language news to spread and communicate faster and faster. Among multi-modal news such as text, image, audio, and video, text news still accounts for the largest proportion of Internet news. In the face of more than 7,000 existing human languages, efficiently identifying the language of text news has become the most basic natural language processing technology, which can select accurate language processing methods for subsequent in-depth content processing and network public opinion analysis. Based on the idea of N-Gram, we designed and implemented a set of language identification methods suitable for all-language Internet news from two aspects: language training and language identification, and applied it to actual text news preprocessing. The language identification results of all-language Internet news show that our method has good recognition accuracy and efficiency.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.