Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19.

Inke R König,Nathan Tintle,Damian Gola,Elizabeth Held,Hsin-Chou Yang,Emily R Holzinger,Marc-André Legault,Rui Sun,Jonathan Auerbach

doi:10.1186/s12863-015-0315-8

Abstract

In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.

Highlights

The analysis of complex genomic data is a challenging endeavor that may be tackled using machine learning and data mining techniques. What these methods have in common is that they search through data to look for patterns
Yang and Lin [6] used the real whole genome sequencing data with 2,769,837 common single nucleotide polymorphisms (SNPs) and 5,578,826 rare variants on the odd-numbered autosomes on 959 related individuals from 20 large pedigrees to estimate the homozygosity intensity for every individual
While Auerbach et al [8] and Yang and Lin [6] mine the genetic data for underlying structure, Sun et al [11] exploit additional multiphenotype information to empower subsequent association analyses with rare variants

Summary

Introduction

The analysis of complex genomic data is a challenging endeavor that may be tackled using machine learning and data mining techniques. Yang and Lin [6] used the real whole genome sequencing data with 2,769,837 common single nucleotide polymorphisms (SNPs) and 5,578,826 rare variants on the odd-numbered autosomes on 959 related individuals from 20 large pedigrees to estimate the homozygosity intensity for every individual.

Results

Conclusion