Abstract

Text processing is an important computer application. Due to its importance, a number of text manipulation programming languages have been devised (e.g. Icon). These programming languages are very useful for applications such as natural language processing, text analysis, text editing, document formatting, text generation, etc. However, they were mainly designed to handle English texts, and are ineffective for Chinese. This is because English and Chinese texts are represented very differently in a computer. An English character is mainly represented in 7-bit ASCII, and its Chinese counterpart commonly in 16-bit GB or BIG-5. This difference makes direct application of English-based text manipulation programming languages to Chinese erroneous, e.g. application of Icon to reverse a string of Chinese characters. In this paper, a new dialect of Icon, referred to as Chicon (i.e. Chinese Icon), is proposed. In the design of Chicon, new data types were introduced to differentiate pure English and English/Chinese mixed texts. In addition, existing Icon text manipulation functions were modified to account for Chinese texts. Experiments have shown that Chicon not only could overcome the problems of Chinese processing in Icon, but its execution speed was actually superior to Icon in handling Chinese. Furthermore, application of Chicon to a real sized problem, namely word segmentation, has proved that the language is practical. © 1998 John Wiley & Sons, Ltd.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call