Plain Text &amp; Character Encoding: A Primer for Data Curators

Seth Erickson

doi:10.7191/jeslib.2021.1211

Abstract

Plain text data consists of a sequence of encoded characters or “code points” from a given standard such as the Unicode Standard. Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards. Plain text representations of digital data are often preferred because plain text formats are relatively stable, and they facilitate reuse and interoperability. Despite its ubiquity, plain text is not as plain as it may seem. The set of standards used in modern text encoding (principally, the Unicode Character Set and the related encoding format, UTF-8) have complex architectures when compared to historical standards like ASCII. Further, while the Unicode standard has gained in prominence, text encoding problems are not uncommon in research data curation. This primer provides conceptual foundations for modern text encoding and guidance for common curation and preservation actions related to textual data.

Highlights

Character encoding is an often-unconsidered aspect of day-to-day computing
Some of the most common file formats for digital data used in eScience (CSV, XML, and JSON, for example) are built atop plain text standards, meaning they are defined as streams of human-readable textual elements
Plain text is often conflated with American Standard Code for Information Interchange (ASCII) text (ASCII is just one plain text standard). This assumption persists, in part because Unicode was purposefully designed not to interfere with it: UTF-8 is backward compatible with ASCII

Summary

Seth Erickson Pennsylvania State University

Let us know how access to this document benefits you. Follow this and additional works at: https://escholarship.umassmed.edu/jeslib Part of the Scholarly Communication Commons, and the Scholarly Publishing Commons. Plain Text & Character Encoding: A Primer for Data Curators.

Introduction

What is Plain Text?

From ASCII to Unicode

Code Points and Abstract Characters

LATIN SMALL LIGATURE FI

Preferred Formats

Format Identification

Invalid Text Data

Misinterpreted Text Data

Mixed Format Data

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Plain Text & Character Encoding: A Primer for Data Curators

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of eScience Librarianship

Lead the way for us

Journal: Journal of eScience Librarianship	Publication Date: Aug 11, 2021
License type: cc-by

Similar Papers

A Cloud-User Protocol Based on Ciphertext Watermarking Technology
Keyang Liu ... Xiaojuan Dong
Security and Communication Networks | VOL. 2017
Keyang Liu, et. al.Keyang Liu ... Xiaojuan Dong
01 Jan 2017
Security and Communication Networks | VOL. 2017

A Mass Spectrometry Proteomics Data Management Platform
Vagisha Sharma ... Michael Riffle
Molecular & cellular proteomics : MCP | VOL. 11
Vagisha Sharma, et. al.Vagisha Sharma ... Michael Riffle
01 Sep 2012
Molecular & cellular proteomics : MCP | VOL. 11

Korpus Arab Pesantren: Digitizing the work of Arabic non-Arabic speakers at Modern Islamic Institution Darussalam Gontor
Yoke Suryadarma ... Gamal Abdul Nasir Zakaria
At-Ta'dib | VOL. 17
Yoke Suryadarma, et. al.Yoke Suryadarma ... Gamal Abdul Nasir Zakaria
01 Jun 2022
At-Ta'dib | VOL. 17

BrainLiner: A Platform for Neurophysiological Data Sharing and Manipulation
Kamitani Yukiyasu
Frontiers in Neuroinformatics | VOL. 5
Kamitani YukiyasuKamitani Yukiyasu
01 Jan 2010
Frontiers in Neuroinformatics | VOL. 5

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Plain Text &amp; Character Encoding: A Primer for Data Curators

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of eScience Librarianship

Plain Text & Character Encoding: A Primer for Data Curators