Abstract

ISO 10646 Universal Character Set (UCS) or Unicode covers symbols in most of the World's written languages. There are various UCS transformation formats (UTF). UTF-8 is compatible with systems that assume 8-bit characters. One of the problems with UTF-8 is its space efficiency. For files containing most Asian characters such as Han ideographs, the file sizes increase by about 50% by using UTF-8. Although the Standard Compression Scheme for Unicode (SCSU) can compress Unicode strings to the size of a locale-specific character set, it is complicated and is not intended to serve as a general purpose interchange format. This paper proposes a page-shift transformation format of ISO 10646, called UTF-S. There are four pages: 1-byte, 2-byte, 3-byte and 4-byte. Shift to page 0 uses a special code ; shift to page 1, 2, and 3 uses ISO 2022 shift codes SO, SS2, and SS3, respectively. We test several text files and compare these UTF with Big5, a locale-specific character set. The result shows that the space efficiency of UTF-S is better than that of UTF-16 and UTF-8 and is close to that of SCSU. UTF-S is suitable for replacing locale-specific character sets with ISO 10646 in Internet applications, such as the World Wide Web. Copyright © 2001 John Wiley & Sons, Ltd.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.