From CSV to Arrow: Creating a Unified Data Set for Efficient Cross-Platform Analysis

Douglas Bates,Jun Yan

doi:10.1080/09332480.2024.2434443

Abstract

Handling open data, like the vast repository of New York City (NYC) 311 service requests, often starts with the ubiquitous CSV (comma-separated value) file format. However, CSV files are notoriously inefficient for curation, bogged down by redundancy and potential misinterpretations. Enter Apache Arrow, a game-changing approach that not only slashes storage requirements but also primes data for seamless analysis across popular platforms like R, Python, and Julia. Using the NYC 311 service request data, we demonstrate the conversion of a CSV file to the Arrow IPC (Inter-Process Communication) format. An Arrow file stores the table schema with the data in a binary format that can be memorymapped for reading, enabling instantaneous access to potentially large datasets. The Arrow IPC data serves as a universal starting point for analysis across various environments. In our example, this conversion is done in Julia, which has powerful packages for reading and writing CSV or Arrow files and calling functions in other popular environments such as R and Python.

Full Text