Computer sequencing and non-alphabetic interference in language data processing

Demetrius J Koubourlis

doi:10.1007/bf02403855

Abstract

ether a list of items to be sorted is a general or special-purpose dictionary, a thesaurus, a concordance, an index, a glossary, a telephone directory, or a book list, be it in English or any other language, the rules for correct alphabetization are synonymous with knowledge of the established alphabetic order for a given language. Additional, but not uniformly observed, guideposts include familiarity with specialized conventions concerning elision and hyphenation, abbreviations and numerals, initial articles, certain proper name prefixes, etc., and adherence to word-by-word or letter-by-letter alphabetization (Collison 1959:84-6 and 176-7, Dulka and Nitecki 1970, Fisk 1968, Harris and Hines 1970, Hines 1967, Hines and Harris 1966:23-38, Holmstrom 1959, Johnson 1957, Knight 1969, Metcalfe 1967, Popecki 1965, American Standards Association Z39 Committee 1958). In automated alphabetization, however, additional constraints are introduced and it becomes necessary to make explicit not only what to do but also what not to do, especially as the alphabet becomes a subset of a character set, which in turn is a subset of a collating sequence (e.g., EBCDIC) containing, apart from bit configurations corresponding to the character set, configurations which, although printable in bit form, have not been assigned single-character representation. A collating sequence is an ordered set so that a one-to-one correspondence exists between its members and the members of a subset of the set of integers. This simply means that each member of a character set has been assigned its ordinal value which is precisely what determines sorting order. The problem, unrecognized by many people in computing, arises in that a character set, besides alphabetic characters (e.g., for English, the letters A-Z) and Arabic and Roman numerals,1 may and normally does include a host of special-purpose symbols such as logical, mathematical, commercial, and last but not least, punctuation. As a consequence, an attempt to alphabetize material containing alphabetic and non-alphabetic characters according to a collating sequence yields unpredictable results. It is possible, of course, to design special-purpose collating sequences consisting of alphabetic characters exclusively. Since special characters will not be members of such a sequence, an attempt to sort entries minimally different in terms of special characters, especially in the presence of

Full Text