Abstract

Handling open data, like the vast repository of New York City (NYC) 311 service requests, often starts with the ubiquitous CSV (comma-separated value) file format. However, CSV files are notoriously inefficient for curation, bogged down by redundancy and potential misinterpretations. Enter Apache Arrow, a game-changing approach that not only slashes storage requirements but also primes data for seamless analysis across popular platforms like R, Python, and Julia. Using the NYC 311 service request data, we demonstrate the conversion of a CSV file to the Arrow IPC (Inter-Process Communication) format. An Arrow file stores the table schema with the data in a binary format that can be memorymapped for reading, enabling instantaneous access to potentially large datasets. The Arrow IPC data serves as a universal starting point for analysis across various environments. In our example, this conversion is done in Julia, which has powerful packages for reading and writing CSV or Arrow files and calling functions in other popular environments such as R and Python.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.