A scalable neuroinformatics data flow for electrophysiological signals using MapReduce.

Catherine Jayapandian,Bilal Zonjy,Satya S Sahoo,Priya Ramesh,Annan Wei,Guo-Qiang Zhang,Samden D Lhatoo,Kenneth Loparo

doi:10.3389/fninf.2015.00004

Catherine Jayapandian, Bilal Zonjy + Show 6 more

Open Access

https://doi.org/10.3389/fninf.2015.00004

Copy DOI

Abstract

Data-driven neuroscience research is providing new insights in progression of neurological disorders and supporting the development of improved treatment approaches. However, the volume, velocity, and variety of neuroscience data generated from sophisticated recording instruments and acquisition methods have exacerbated the limited scalability of existing neuroinformatics tools. This makes it difficult for neuroscience researchers to effectively leverage the growing multi-modal neuroscience data to advance research in serious neurological disorders, such as epilepsy. We describe the development of the Cloudwave data flow that uses new data partitioning techniques to store and analyze electrophysiological signal in distributed computing infrastructure. The Cloudwave data flow uses MapReduce parallel programming algorithm to implement an integrated signal data processing pipeline that scales with large volume of data generated at high velocity. Using an epilepsy domain ontology together with an epilepsy focused extensible data representation format called Cloudwave Signal Format (CSF), the data flow addresses the challenge of data heterogeneity and is interoperable with existing neuroinformatics data representation formats, such as HDF5. The scalability of the Cloudwave data flow is evaluated using a 30-node cluster installed with the open source Hadoop software stack. The results demonstrate that the Cloudwave data flow can process increasing volume of signal data by leveraging Hadoop Data Nodes to reduce the total data processing time. The Cloudwave data flow is a template for developing highly scalable neuroscience data processing pipelines using MapReduce algorithms to support a variety of user applications.

Highlights

Electrophysiological signal data, such as electroencephalogram (EEG) and electrocardiogram (ECG), are critical to both neuroscience research and patient care (Bartolomei et al, 2008; Wendling et al, 2010)
The data flow was executed over a High Performance Compute Cluster (HPCC) at the Case Western Reserve University (CWRU) using the open source Hadoop software
The HPCC consists of 30 data nodes and a master node that are connected by a 10 Gigabit Ethernet (GigE)

Summary

Introduction

Electrophysiological signal data, such as electroencephalogram (EEG) and electrocardiogram (ECG), are critical to both neuroscience research and patient care (Bartolomei et al, 2008; Wendling et al, 2010). EEG signal data plays a key role in neurological disease treatment, for example it is used as gold standard for identifying the seizure onset zone in focal epilepsy patients during presurgical evaluation (Rosenow and Lüders, 2001). The growing sophistication of signal recording hardware and signal analysis techniques, for example development of epileptogenicity index using Stereotactic EEG and MRI data for characterizing seizure onset zone (Bartolomei et al, 2008), has significantly increased data management challenges for signal data.

Objectives

Methods

Results

Discussion

Conclusion