A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

Hiromasa Horiguchi,Hideo Yasunaga,Kazuhiko Ohe,Hideki Hashimoto

doi:10.1186/1472-6947-12-151

Abstract

BackgroundSecondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. MapReduce technology such as Hadoop is a promising tool for this purpose, though its use has been limited by the lack of user-friendly functions for transforming large scale data into wide table format, where each subject is represented by one row, for use in health services and clinical research. Since the original specification of Pig provides very few functions for column field management, we have developed a novel system called GroupFilterFormat to handle the definition of field and data content based on a Pig Latin script. We have also developed, as an open-source project, several user-defined functions to transform the table format using GroupFilterFormat and to deal with processing that considers date conditions.ResultsHaving prepared dummy discharge summary data for 2.3 million inpatients and medical activity log data for 950 million events, we used the Elastic Compute Cloud environment provided by Amazon Inc. to execute processing speed and scaling benchmarks. In the speed benchmark test, the response time was significantly reduced and a linear relationship was observed between the quantity of data and processing time in both a small and a very large dataset. The scaling benchmark test showed clear scalability. In our system, doubling the number of nodes resulted in a 47% decrease in processing time.ConclusionsOur newly developed system is widely accessible as an open resource. This system is very simple and easy to use for researchers who are accustomed to using declarative command syntax for commercial statistical software and Structured Query Language. Although our system needs further sophistication to allow more flexibility in scripts and to improve efficiency in data processing, it shows promise in facilitating the application of MapReduce technology to efficient data processing with large scale administrative data in health services and clinical research.

Highlights

Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand
Given the background and incentives above, we have developed several user-defined functions (UDF) to process large scale administrative data for ease of epidemiological analysis, based on a Pig Latin script in the Hadoop framework [17]
Since the original Pig has very limited functions for column field management, we newly developed GroupFilterFormat to handle the definition of field and data content

Summary

Introduction

Secondary use of large scale administrative data is increasingly popular in health services and clinical research, where a user-friendly tool for data management is in great demand. Secondary large scale data such as nation-wide administrative data are increasingly utilized in clinical and health service research for timely outcomes studies in real world settings [1,2,3,4,5]. This trend has further been fueled by recent improvements in informatics technology for handling ultra large volumes of on-site data through work parallelization and cloud computing. Extracting data from different sources requires linkage of data with multiple unique patient identifiers, and complicated steps for data merge and transformation (Figure 1) It is often necessary in epidemiological studies to calculate the time interval between different events recorded in different rows, and to transform these data into a wide table column. The time interval between the first and last dates of antibiotic administration would need to be calculated and queried

Objectives

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Informatics and Decision Making	Publication Date: Dec 1, 2012
Citations: 19	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making

Lead the way for us

Similar Papers

Opportunities in mental health services research.
Leslie J Scallet ... Gail K Robinson
Health Affairs | VOL. 12
Leslie J Scallet, et. al.Leslie J Scallet ... Gail K Robinson
01 Jan 1992
Health Affairs | VOL. 12

Perceived Need for Mental Health Care and Service Use Among Adults in Western Europe: Results of the ESEMeD Project
Miquel Codony ... Gemma Vilagut
Psychiatric Services | VOL. 60
Miquel Codony, et. al.Miquel Codony ... Gemma Vilagut
01 Aug 2009
Psychiatric Services | VOL. 60

The relationship between MECP2 mutation type and health status and service use trajectories over time in a Rett syndrome population
Deidra Young ... Helen Leonard
Research in Autism Spectrum Disorders | VOL. 5
Deidra Young, et. al.Deidra Young ... Helen Leonard
24 Jul 2010
Research in Autism Spectrum Disorders | VOL. 5

Outcomes research in endoscopy: current status and future directions
Gregory S Cooper
Gastrointestinal Endoscopy | VOL. 46
Gregory S CooperGregory S Cooper
01 Oct 1997
Gastrointestinal Endoscopy | VOL. 46

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A user-friendly tool to transform large scale administrative data into wide table format using a mapreduce program with a pig latin based script

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Informatics and Decision Making