Research on Automatic identification of Chinese minority language websites

Haifeng Liu,Jing Li,Yuanyuan Yang,Zhiqiang Han

doi:10.2991/icismme-15.2015.175

Abstract

This paper presents features of Chinese minority language text collection on websites, analyses the problems of webpage identification of Chinese minority language text, and proposes three automatic identification methods. Based on these methods, designs and realizes software to identify Chinese minority language text such as: Mongolian, Tibetan, Uyghur, Kazak, Kirgiz, Yi script Tai Lue script, Korean, Russian, Zhuang script and so on. Introduction and Related works It is generally thought that Mustonen (1965) [1] proposed to identify different texts according to the characters of various languages is the beginning of text identification. The early research mainly relied on language rules, with the development of computer, methods of natural language identification changed from analyzing language rules to statistical analysis. Cavnar (1994) [2] presented N-gram text automatic identification, which is classical method based on statistical analysis. Cavnar used N-gram to test 3478 texts in 8 languages, the rate of correct identification reached 99.8%. The same year, Dunning reached 99.9% combining markov model and N-gram. After that, some scholar applied statistics algorithm such as relative entropy [3] and SVM [4] to text identification, and used skills like smoothing technique to make the identification rate reached 99.998% [5]. The increasing of webpages of various language texts appealed scholars to research multi-text identification skills between different language families or the same language family[6][7][8][9], the correct identification rate is on the rise. In China, we use the method combing rules and statistics to identify Tibetan [10] [11], Mongolian [12] and Uyghur texts [13] [14], correct identification rate can reach 100%, 80% and 97%. But research on other minority language texts’ identification is less. Features of minority websites identification Present problems. Compared with Chinese and English characters, minority language characters have obvious features. Some minority languages have various characters of the same language by the influence of history. The computer skills of dealing with minority language character are fall behind. Half of the minority websites are folk websites, whose source code is not standard, some minority language characters’ encoding is not unified, which makes the encoding of the same language not compatible. Present problems of automatic identification mainly come from features of minority language itself and immaturity of supporting technology. Features of minority websites. (1) The same language has different characters. Such as Mongolian (Traditional Mongolian, Tod Mongolian, New Mongolian), Uyghur (Arabic character, Latin character), Kazak (Arabic character, Latin character, Kirill character), Tai Lue (New Tai Lue, Old Tai Lue). (2)The same character has different encoding. Tibetan and traditional Mongolian have the most encoding forms. Tibetan has Unicode, Founder, Ascll (11 forms), HuaGuang, Tibetan University, International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015) © 2015. The authors Published by Atlantis Press 836 Tonguer, Pandita and so on. Traditional Mongolian has Unicode, Menk, Hussein, Fonder, Minggatu, Oyuta, Burigude and so on. (3)The same character with different encoding has cross and overlap region. In Tibetan, part of GB2312 encoding has cross-field; In Mongolian, part of Unicode encoding has cross-field. Table1. Part of GB2312-based Tibetan encoding encoding First byte scope Trail byte Syllable point encoding Founder DOS 0xC0-0xEE 0x21-0x7E 0xC032 Founder Windows 0xAA0xAC,0xB0-0xDE 0xA0-0xFE 0xAAAC HuaGuang DOS 0xB0-0xFB 0x21-0x7E 0xE162 HuaGuang Windows 0xB0-0xFB 0xA1-0xFE 0xE1E2 Tonguer encoding 0x81-0xEE,0xF5 0x210x7E,0x40-0xFE 0xA6E6 Tibetan University encoding 0xAA-0xAF,0xF8-0xFB 0xA1-0xFE 0xFABB Table2. Part of Unicode-based Mongolian encoding encoding Encoding scope

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Research on Automatic identification of Chinese minority language websites

Abstract

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2015
Citations: 1	License type: cc-by-nc

Similar Papers

A Telephone Speech Corpus of China’s Minority languages for Automatic Language Identification
Jian Yang ... Yonghua Xu
-
Jian Yang, et. al.Jian Yang ... Yonghua Xu
01 Jan 2013
01 Jan 2013

Maintenance and Loss of Minority Languages
-
-
--
18 Jun 1992
18 Jun 1992

Uyghur language text detection in images
Jian Yin ... Shun Liu
-
Jian Yin, et. al.Jian Yin ... Shun Liu
29 Aug 2016
29 Aug 2016

Identity Signatures Extraction of Latin and Arabic Characters
C Gmati ... H Amiri
-
C Gmati, et. al.C Gmati ... H Amiri
01 Nov 2018
01 Nov 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Research on Automatic identification of Chinese minority language websites

Abstract

Talk to us

Similar Papers