Practical guide for managing large-scale human genome data in research

Tomoya Tanjo,Katsushi Tokunaga,Yosuke Kawai,Osamu Ogasawara,Masao Nagasaki

doi:10.1038/s10038-020-00862-1

Abstract

Studies in human genetics deal with a plethora of human genome sequencing data that are generated from specimens as well as available on public domains. With the development of various bioinformatics applications, maintaining the productivity of research, managing human genome data, and analyzing downstream data is essential. This review aims to guide struggling researchers to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses. Here, we discuss worldwide human genome projects that could be integrated into any data for improved analysis. Obtaining human whole-genome sequencing data from both data stores and processes is costly; therefore, we focus on the development of data format and software that manipulate whole-genome sequencing. Once the sequencing is complete and its format and data processing tools are selected, a computational platform is required. For the platform, we describe a multi-cloud strategy that balances between cost, performance, and customizability. A good quality published research relies on data reproducibility to ensure quality results, reusability for applications to other datasets, as well as scalability for the future increase of datasets. To solve these, we describe several key technologies developed in computer science, including workflow engine. We also discuss the ethical guidelines inevitable for human genomic data analysis that differ from model organisms. Finally, the future ideal perspective of data processing and analysis is summarized.

Highlights

In human genetics, advancements in next-generation sequencing technology have enabled population-scale sequencing from just one sequencer and allowed sharing millions of human genome sequencing data from publicly archived data including privacy-protected ones
This review aims to guide researchers in human genetics to process and analyze these large-scale genomic data to extract relevant information for improved downstream analyses in their specific research domains
In “How to store and analyze human genome data efficiently?” section, we focus on the development of data format and software that manipulate whole-genome sequencing (WGS) including hardware-based acceleration

Summary

1234567890();,: 1234567890();,: Introduction

Advancements in next-generation sequencing technology have enabled population-scale sequencing from just one sequencer and allowed sharing millions of human genome sequencing data from publicly archived data including privacy-protected ones. The genomic data are widely distributed under the open access policy though various computational platforms, e.g., high-performance computing (HPC) system of the National Institute of Genetics (NIG) in Japan and public cloud services. These efforts ease the reusability by researchers. The combination of a workflow description language and workflow engines allows the portability to different computational environments and the scalability of data analysis that adapts to the increase of the size of computational resources. On July 16, 2020, the Court of Justice of the European Union issued a judgment declaring as “invalid” on the adequacy of the protection provided by the EU-U.S Privacy Shield (https://www.privacyshield.gov/Program-Overview)

Conclusion and future direction

Compliance with ethical standards

50. Multicloud

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Human Genetics	Publication Date: Oct 23, 2020
Citations: 41	License type: open-access

R Discovery Prime

R Discovery Prime

Practical guide for managing large-scale human genome data in research

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Human Genetics

Lead the way for us

Similar Papers

When gene meets cloud: Enabling scalable and efficient range query on encrypted genomic data
Wenhai Sun ... Wenjing Lou
-
Wenhai Sun, et. al.Wenhai Sun ... Wenjing Lou
01 May 2017
01 May 2017

A Secure Alignment Algorithm for Mapping Short Reads to Human Genome.
Yongan Zhao ... Xiaofeng Wang
Journal of computational biology : a journal of computational molecular cell biology | VOL. 25
Yongan Zhao, et. al.Yongan Zhao ... Xiaofeng Wang
09 May 2018
Journal of computational biology : a journal of computational molecular cell biology | VOL. 25

Implementation of human whole genome sequencing data analysis: A containerized framework for sustained and enhanced throughput
Abhishek Panda ... Bratati Kahali
Informatics in Medicine Unlocked | VOL. 25
Abhishek Panda, et. al.Abhishek Panda ... Bratati Kahali
01 Jan 2020
Informatics in Medicine Unlocked | VOL. 25

Conceptual and practical considerations for material and data sharing in stem cell research
Kazuto Kato
-
Kazuto KatoKazuto Kato
01 Jan 2012
01 Jan 2012

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Practical guide for managing large-scale human genome data in research

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of Human Genetics