Abstract

Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification.

Highlights

  • Language identification (LI) is generally viewed as a form of text categorization

  • Polyglot can identify more than 196 languages, but we were unable to access all the language sources in Polyglot from currently available corpora sets. langid.py was pre-trained on 97 languages, all of which can be accessed in the Leipzig Corpora Collection. langid.py is superior to most other open-source LI tools in terms of short-text LI accuracy [36]

  • We found that the performance of naïve Bayes (NB) in LI in similar language groups (SLGs)

Read more

Summary

Introduction

It is a process that attempts to classify text in a language into a pre-defined set of known languages [1]. LI is often considered a solved problem, studies have verified that LI accuracy rapidly drops when identifying short text [5,6,7], and confusion errors often occur between languages in the same family or in similar language groups [3,4,8]. Languages are written in different scripts, and each script has a unique defined code range in Unicode. This helps identify different parts of a script within a document [9]. Languages belong to different families, and language families can be divided into similar phylogenetic units [10]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.