Factors that affect the accuracy of text-based language identification

Gerrit Reinier Botha,Etienne Barnard

doi:10.1016/j.csl.2012.01.004

Abstract

The classification accuracy of text-based language identification depends on several factors, including the size of the text fragment to be identified, the amount of training data available, the classification features and algorithm employed, and the similarity of the languages to be identified. To date, no systematic study of these factors and their interactions has been published. We therefore investigate the effects of each of these factors and their relations on the performance of text-based language identification.Our study uses n-gram statistics as features for classification. In particular, we compare support vector machines, Naïve Bayesian and difference-in-frequency classifiers on different amounts of input text and various values of n for different amounts of training data. For a fixed value of n the support vector machines generally outperform the other classifiers, but the simpler classifiers are able to handle larger values of n. The additional computational complexity of training the support vector machine classifier may not be justified in light of importance of using a large value of n, except possibly for small sizes of the input window when limited training data is available.Our training and testing corpora consisted of text from the 11 official languages of South Africa. Within these languages distinct language families can be found. We find that it is much more difficult to discriminate languages within languages families than languages in different families. The overall accuracy on short input strings is low for this reason, but for input strings of 100 characters or more there is only a slight confusion within families and accuracies as high as 99.4% are achieved. For the smallest input strings studied here, which consist of 15 characters, the best accuracy achieved is only 83%, but when languages in different families are grouped together, this corresponds to a usable 95.1% accuracy.The relationship between the amount of training data and the accuracy achieved is found to depend on the window size: for the largest window (300 characters) about 400000 characters are sufficient to achieve close-to-optimal accuracy, whereas improvements in accuracy are found even beyond 1.6 million characters of training data for smaller windows.Our study concludes that the correlation between the factors studied significantly affect classification accuracy; therefore, to assure credible and comparable results, these factors need to be controlled in any text-based language identification task.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Factors that affect the accuracy of text-based language identification

Abstract

Talk to us

Similar Papers

More From: Computer Speech & Language

Lead the way for us

Journal: Computer Speech & Language	Publication Date: Jan 16, 2012
Citations: 48

Similar Papers

Further Advantages of Data Augmentation on Convolutional Neural Networks
Alex Hernández-García ... Peter König
-
Alex Hernández-García, et. al.Alex Hernández-García ... Peter König
01 Jan 2018
01 Jan 2018

Improving speech understanding accuracy with limited training data using multiple language models and multiple understanding models
Masaki Katsumaru ... Kotaro Funakoshi
-
Masaki Katsumaru, et. al.Masaki Katsumaru ... Kotaro Funakoshi
06 Sep 2009
06 Sep 2009

Learning a robust CNN-based rotation insensitive model for ship detection in VHR remote sensing images
Zhong Dong ... Baojun Lin
International Journal of Remote Sensing | VOL. 41
Zhong Dong, et. al.Zhong Dong ... Baojun Lin
09 Jan 2020
International Journal of Remote Sensing | VOL. 41

Benchmarking the performance of SVMs and HMMs for accelerometer-based biometric gait recognition
Claudia Nickel ... Holger Brandt
-
Claudia Nickel, et. al.Claudia Nickel ... Holger Brandt
01 Dec 2011
01 Dec 2011

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Factors that affect the accuracy of text-based language identification

Abstract

Talk to us

Similar Papers

More From: Computer Speech &amp; Language

More From: Computer Speech & Language