On Hierarchical Text Language-Identification Algorithms

Maimaitiyiming Hasimu,Wushour Silamu

doi:10.3390/a11040039

Abstract

Text on the Internet is written in different languages and scripts that can be divided into different language groups. Most of the errors in language identification occur with similar languages. To improve the performance of short-text language identification, we propose four different levels of hierarchical language identification methods and conducted comparative tests in this paper. The efficiency of the algorithms was evaluated on sentences from 97 languages, and its macro-averaged F1-score reached in four-stage language identification was 0.9799. The experimental results verified that, after script identification, language group identification and similar language group identification, the performance of the language identification algorithm improved with each stage. Notably, the language identification accuracy between similar languages improved substantially. We also investigated how foreign content in a language affects language identification.

Highlights

Language identification (LI) is generally viewed as a form of text categorization
Polyglot can identify more than 196 languages, but we were unable to access all the language sources in Polyglot from currently available corpora sets. langid.py was pre-trained on 97 languages, all of which can be accessed in the Leipzig Corpora Collection. langid.py is superior to most other open-source LI tools in terms of short-text LI accuracy [36]
We found that the performance of naïve Bayes (NB) in LI in similar language groups (SLGs)

Summary

Introduction

It is a process that attempts to classify text in a language into a pre-defined set of known languages [1]. LI is often considered a solved problem, studies have verified that LI accuracy rapidly drops when identifying short text [5,6,7], and confusion errors often occur between languages in the same family or in similar language groups [3,4,8]. Languages are written in different scripts, and each script has a unique defined code range in Unicode. This helps identify different parts of a script within a document [9]. Languages belong to different families, and language families can be divided into similar phylogenetic units [10]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Algorithms	Publication Date: Mar 27, 2018
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

On Hierarchical Text Language-Identification Algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms

Lead the way for us

Similar Papers

Optimizing n‑gram Order of an n‑gram Based Language Identification Algorithm for 68 Written Languages
Chew Y Choong ... Yoshiki Mikami
International Journal on Advances in ICT for Emerging Regions (ICTer) | VOL. 2
Chew Y Choong, et. al.Chew Y Choong ... Yoshiki Mikami
08 Dec 2009
International Journal on Advances in ICT for Emerging Regions (ICTer) | VOL. 2

Multilingual Speech Corpus in Low-Resource Eastern and Northeastern Indian Languages for Speaker and Language Identification
Joyanta Basu ... Tapan Kumar Basu
Circuits, Systems, and Signal Processing | VOL. 40
Joyanta Basu, et. al.Joyanta Basu ... Tapan Kumar Basu
20 Apr 2021
Circuits, Systems, and Signal Processing | VOL. 40

Unsupervised Deep Language and Dialect Identification for Short Texts
Koustava Goswami ... Theodorus Fransen
-
Koustava Goswami, et. al.Koustava Goswami ... Theodorus Fransen
01 Jan 2020
01 Jan 2020

Performance Enhancement of Indian LID System Using Spectral Processing
Phani Kumar Polasi ... Thota Ramyasri
Indian Journal of Science and Technology | VOL. 9
Phani Kumar Polasi, et. al.Phani Kumar Polasi ... Thota Ramyasri
30 Dec 2016
Indian Journal of Science and Technology | VOL. 9

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

On Hierarchical Text Language-Identification Algorithms

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Algorithms