Abstract
This paper presents features of Chinese minority language text collection on websites, analyses the problems of webpage identification of Chinese minority language text, and proposes three automatic identification methods. Based on these methods, designs and realizes software to identify Chinese minority language text such as: Mongolian, Tibetan, Uyghur, Kazak, Kirgiz, Yi script Tai Lue script, Korean, Russian, Zhuang script and so on. Introduction and Related works It is generally thought that Mustonen (1965) [1] proposed to identify different texts according to the characters of various languages is the beginning of text identification. The early research mainly relied on language rules, with the development of computer, methods of natural language identification changed from analyzing language rules to statistical analysis. Cavnar (1994) [2] presented N-gram text automatic identification, which is classical method based on statistical analysis. Cavnar used N-gram to test 3478 texts in 8 languages, the rate of correct identification reached 99.8%. The same year, Dunning reached 99.9% combining markov model and N-gram. After that, some scholar applied statistics algorithm such as relative entropy [3] and SVM [4] to text identification, and used skills like smoothing technique to make the identification rate reached 99.998% [5]. The increasing of webpages of various language texts appealed scholars to research multi-text identification skills between different language families or the same language family[6][7][8][9], the correct identification rate is on the rise. In China, we use the method combing rules and statistics to identify Tibetan [10] [11], Mongolian [12] and Uyghur texts [13] [14], correct identification rate can reach 100%, 80% and 97%. But research on other minority language texts’ identification is less. Features of minority websites identification Present problems. Compared with Chinese and English characters, minority language characters have obvious features. Some minority languages have various characters of the same language by the influence of history. The computer skills of dealing with minority language character are fall behind. Half of the minority websites are folk websites, whose source code is not standard, some minority language characters’ encoding is not unified, which makes the encoding of the same language not compatible. Present problems of automatic identification mainly come from features of minority language itself and immaturity of supporting technology. Features of minority websites. (1) The same language has different characters. Such as Mongolian (Traditional Mongolian, Tod Mongolian, New Mongolian), Uyghur (Arabic character, Latin character), Kazak (Arabic character, Latin character, Kirill character), Tai Lue (New Tai Lue, Old Tai Lue). (2)The same character has different encoding. Tibetan and traditional Mongolian have the most encoding forms. Tibetan has Unicode, Founder, Ascll (11 forms), HuaGuang, Tibetan University, International Conference on Information Sciences, Machinery, Materials and Energy (ICISMME 2015) © 2015. The authors Published by Atlantis Press 836 Tonguer, Pandita and so on. Traditional Mongolian has Unicode, Menk, Hussein, Fonder, Minggatu, Oyuta, Burigude and so on. (3)The same character with different encoding has cross and overlap region. In Tibetan, part of GB2312 encoding has cross-field; In Mongolian, part of Unicode encoding has cross-field. Table1. Part of GB2312-based Tibetan encoding encoding First byte scope Trail byte Syllable point encoding Founder DOS 0xC0-0xEE 0x21-0x7E 0xC032 Founder Windows 0xAA0xAC,0xB0-0xDE 0xA0-0xFE 0xAAAC HuaGuang DOS 0xB0-0xFB 0x21-0x7E 0xE162 HuaGuang Windows 0xB0-0xFB 0xA1-0xFE 0xE1E2 Tonguer encoding 0x81-0xEE,0xF5 0x210x7E,0x40-0xFE 0xA6E6 Tibetan University encoding 0xAA-0xAF,0xF8-0xFB 0xA1-0xFE 0xFABB Table2. Part of Unicode-based Mongolian encoding encoding Encoding scope
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.