The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Mamtimin Qasim,Wushour Silamu,Minghui Qiu

doi:10.3390/data9110134

Abstract

Script identification is easier to implement than language identification, and its identification rate is very high. The fewer languages are identified when using a language identification algorithm, the higher the identification rate is. However, no systematic study on SI involving multiple languages and determining how to construct relevant language identification datasets has been conducted. Therefore, in this paper, we discuss and design a script identification algorithm and the construction of a language identification dataset based on script groups. The data sources in this paper comprise 261 different languages’ text corpora from the Leipzig Corpora Collection, which are grouped into 23 different script groups. In the Unicode encoding scheme, different scripts are arranged into different code regions. Based on this feature, we propose a written script identification algorithm based on regular expression matching, the micro F-score of which reaches 0.9929 in sentence-level script identification experiments. To reduce noise when constructing the language identification dataset for each script, a script identification algorithm is used to filter out other-script content in each text.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Abstract

Talk to us

Similar Papers

More From: Data

Lead the way for us

Journal: Data	Publication Date: Nov 11, 2024
License type: CC BY 4.0

Similar Papers

On Hierarchical Text Language-Identification Algorithms
Maimaitiyiming Hasimu ... Wushour Silamu
Algorithms | VOL. 11
Maimaitiyiming Hasimu, et. al.Maimaitiyiming Hasimu ... Wushour Silamu
27 Mar 2018
Algorithms | VOL. 11

Language and Dialect Identification of Cuneiform Texts
Tommi Jauhiainen ... Tero Alstola
-
Tommi Jauhiainen, et. al.Tommi Jauhiainen ... Tero Alstola
01 Jan 2019
01 Jan 2019

Writing type, script and language identification in heterogeneous documents
Anis Mezghani ... Fouad Slimane
International Journal of Intelligent Systems Technologies and Applications | VOL. 16
Anis Mezghani, et. al.Anis Mezghani ... Fouad Slimane
01 Jan 2017
International Journal of Intelligent Systems Technologies and Applications | VOL. 16

Language and Script Identification Based on Steerable Pyramid Features
Mohamed Benjelil ... Adel M Alimi
-
Mohamed Benjelil, et. al.Mohamed Benjelil ... Adel M Alimi
01 Sep 2012
01 Sep 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The Design of a Script Identification Algorithm and Its Application in Constructing a Text Language Identification Dataset

Abstract

Talk to us

Similar Papers

More From: Data