A page-shift transformation format of ISO 10646

Pei‐Chi Wu

doi:10.1002/spe.427

Abstract

ISO 10646 Universal Character Set (UCS) or Unicode covers symbols in most of the World's written languages. There are various UCS transformation formats (UTF). UTF-8 is compatible with systems that assume 8-bit characters. One of the problems with UTF-8 is its space efficiency. For files containing most Asian characters such as Han ideographs, the file sizes increase by about 50% by using UTF-8. Although the Standard Compression Scheme for Unicode (SCSU) can compress Unicode strings to the size of a locale-specific character set, it is complicated and is not intended to serve as a general purpose interchange format. This paper proposes a page-shift transformation format of ISO 10646, called UTF-S. There are four pages: 1-byte, 2-byte, 3-byte and 4-byte. Shift to page 0 uses a special code ; shift to page 1, 2, and 3 uses ISO 2022 shift codes SO, SS2, and SS3, respectively. We test several text files and compare these UTF with Big5, a locale-specific character set. The result shows that the space efficiency of UTF-S is better than that of UTF-16 and UTF-8 and is close to that of SCSU. UTF-S is suitable for replacing locale-specific character sets with ISO 10646 in Internet applications, such as the World Wide Web. Copyright © 2001 John Wiley & Sons, Ltd.

Full Text