Chemical and linguistic considerations for encoding Chinese characters: an embodiment using chain-end degradable sequence-defined oligourethanes created by consecutive solid phase click chemistry.

Eric V Anslyn,Le Zhang,Todd B Krause,Hyun Meen Park,Danny Law,Harnimarta Deol,Bipin Pandey,Qifan Xiao,Brent L Iverson

doi:10.1039/d3sc06189b

Abstract

Sequence-defined polymers (SDPs) are currently being investigated for use as information storage media. As the number of monomers in the SDPs increases, with a corresponding increase in mathematical base, the use of tandem-MS for de novo sequencing becomes more challenging. In contrast, chain-end degradation routines are truly de novo, potentially allowing very large mathematical bases for encoding. While alphabetic scripts have a few dozen symbols, logographic scripts, such as Chinese, can have several thousand symbols. Using a new in situ consecutive click reaction approach on an oligourethane backbone for writing, and a previously reported chain-end degradation routine for reading, we encoded/decoded a confucius proverb written in Chinese characters using two encoding schemes: Unicode and Zhèng Mă. Unicode is an internationally standardized arbitrary string of hexadecimal (base-16) symbols which efficiently encodes uniquely identifiable symbols but requires complete fidelity of transmission, or context-based inferential strategies to be interpreted. The Zhèng Mă approach encodes with a base-26 system using the visual characteristics and internal composition of Chinese characters themselves, which leads to greater ambiguity of encoded strings, but more robust retrievability of information from partial or corrupted encodings. The application of information-encoded oligourethanes to two different encoding systems allowed us to establish their flexibility and versatility for data storage. We found the oligourethanes immensely adaptable to both encoding schemes for Chinese characters, and we highlight the expected tradeoff between the efficiency and uniqueness of Unicode encoding on the one hand, and the fidelity to a scripts' particular visual characteristics on the other.

Full Text