Abstract

BackgroundNext-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is expected that curated secondary databases will help organize some of this Big Data thereby allowing users better navigate, search and compute on it.ResultsTo address the above challenge, we have implemented a NGS biocuration workflow and are analyzing short read sequences and associated metadata from cancer patients to better understand the human variome. Curation of variation and other related information from control (normal tissue) and case (tumor) samples will provide comprehensive background information that can be used in genomic medicine research and application studies. Our approach includes a CloudBioLinux Virtual Machine which is used upstream of an integrated High-performance Integrated Virtual Environment (HIVE) that encapsulates Curated Short Read archive (CSR) and a proteome-wide variation effect analysis tool (SNVDis). As a proof-of-concept, we have curated and analyzed control and case breast cancer datasets from the NCI cancer genomics program - The Cancer Genome Atlas (TCGA). Our efforts include reviewing and recording in CSR available clinical information on patients, mapping of the reads to the reference followed by identification of non-synonymous Single Nucleotide Variations (nsSNVs) and integrating the data with tools that allow analysis of effect nsSNVs on the human proteome. Furthermore, we have also developed a novel phylogenetic analysis algorithm that uses SNV positions and can be used to classify the patient population. The workflow described here lays the foundation for analysis of short read sequence data to identify rare and novel SNVs that are not present in dbSNP and therefore provides a more comprehensive understanding of the human variome. Variation results for single genes as well as the entire study are available from the CSR website (http://hive.biochemistry.gwu.edu/dna.cgi?cmd=csr).ConclusionsAvailability of thousands of sequenced samples from patients provides a rich repository of sequence information that can be utilized to identify individual level SNVs and their effect on the human proteome beyond what the dbSNP database provides.

Highlights

  • Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis

  • International and national networks of collaboration, such as International Cancer Genome Consortium (ICGC) [4], Global Cancer Genomics Consortium (GCGC) [5], Early Detection Research Network (EDRN) [6] and other NCI programs (like The Cancer Genome Atlas program (TCGA) [7]), are generating increasingly large amounts of data, the vast majority of which is from next-generation sequencing (NGS) technologies

  • Through focused analysis of the breast cancer samples, we show how a workflow that identifies novel variations, explores the effects of Non-synonymous single-nucleotide variation (nsSNV) on the human proteome and classification of patients based on Single Nucleotide Variations (SNVs) can provide a higher level of information that can be used by researchers to evaluate experimental targets and to generate and test hypothesis related to personalized medicine

Read more

Summary

Introduction

Next-generation sequencing (NGS) technologies have resulted in petabytes of scattered data, decentralized in archives, databases and sometimes in isolated hard-disks which are inaccessible for browsing and analysis. It is desirable that cancer biologists will use this data to develop and test hypotheses, realistically, few wet-laboratory researchers have the infrastructure or knowledge regarding scores of complex bioinformatics tools to glean a higher understanding from the disparate sequence files and complex, scattered annotations These challenges are leading to the development of tools and secondary databases which are expected to democratize Big Data use [8,9,10], and initiatives such as the Human Variome Project has started playing an important role by providing guidelines that encourage standardizing and sharing of information related to human genetic variation [11,12,13]

Methods
Results
Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.