Abstract

Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.

Highlights

  • Character encoding is an often-unconsidered aspect of day-to-day computing

  • Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards, meaning they are defined as streams of human-readable textual elements

  • Plain text is often conflated with American Standard Code for Information Interchange (ASCII) text (ASCII is just one plain text standard). This assumption persists, in part because Unicode was purposefully designed not to interfere with it: UTF-8 is backward compatible with ASCII

Read more

Summary

Seth Erickson Pennsylvania State University

Let us know how access to this document benefits you. Follow this and additional works at: https://escholarship.umassmed.edu/jeslib Part of the Scholarly Communication Commons, and the Scholarly Publishing Commons. Plain Text & Character Encoding: A Primer for Data Curators.

Introduction
What is Plain Text?
From ASCII to Unicode
Code Points and Abstract Characters
LATIN SMALL LIGATURE FI
Preferred Formats
Format Identification
Invalid Text Data
Misinterpreted Text Data
Mixed Format Data
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call