Genetic Classification of Populations Using Supervised Learning

Michael Bridges,Ricardo Segurado,Colm O'Dushlaine,Carlos Pinto,Michael Gill,Elizabeth A Heron,Aiden Corvin,Derek Morris

doi:10.1371/journal.pone.0014802

Michael Bridges, Ricardo Segurado + Show 6 more

Open Access

https://doi.org/10.1371/journal.pone.0014802

Copy DOI

Journal: PLoS ONE	Publication Date: May 12, 2011
Citations: 30	License type: cc-by

Affiliation: Trinity College Dublin

Abstract

There are many instances in genetics in which we wish to determine whether two candidate populations are distinguishable on the basis of their genetic structure. Examples include populations which are geographically separated, case–control studies and quality control (when participants in a study have been genotyped at different laboratories). This latter application is of particular importance in the era of large scale genome wide association studies, when collections of individuals genotyped at different locations are being merged to provide increased power. The traditional method for detecting structure within a population is some form of exploratory technique such as principal components analysis. Such methods, which do not utilise our prior knowledge of the membership of the candidate populations. are termed unsupervised. Supervised methods, on the other hand are able to utilise this prior knowledge when it is available.In this paper we demonstrate that in such cases modern supervised approaches are a more appropriate tool for detecting genetic differences between populations. We apply two such methods, (neural networks and support vector machines) to the classification of three populations (two from Scotland and one from Bulgaria). The sensitivity exhibited by both these methods is considerably higher than that attained by principal components analysis and in fact comfortably exceeds a recently conjectured theoretical limit on the sensitivity of unsupervised methods. In particular, our methods can distinguish between the two Scottish populations, where principal components analysis cannot. We suggest, on the basis of our results that a supervised learning approach should be the method of choice when classifying individuals into pre-defined populations, particularly in quality control for large scale genome wide association studies.

Highlights

The advent of the new large-scale genotyping and sequencing technologies has resulted in unprecedented quantities of data becoming available to the genetics community
In view of the fact that other scientific fields have already gone through a similar process of development, it is likely that cross-disciplinary collaborations in data analysis will yield fruitful results in genetics
We first perform a principal components analysis (PCA) on the three populations to determine whether the populations can be distinguished using an unsupervised learning approach

Summary

Introduction

The advent of the new large-scale genotyping and sequencing technologies has resulted in unprecedented quantities of data becoming available to the genetics community. In view of the fact that other scientific fields have already gone through a similar process of development, it is likely that cross-disciplinary collaborations in data analysis will yield fruitful results in genetics. We apply machine learning techniques previously used in cosmology to the problem of genetic classification Such techniques involve the use of automated algorithms to mimic the learning capabilities of animal brains. They have proved extremely useful in the analysis of complex data in many scientific disciplines. To date, relied mainly on unsupervised methods, such as principal components analysis (PCA), to classify individuals on the basis of their genetic data

Objectives

Methods

Results

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Genetic Classification of Populations Using Supervised Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE

Lead the way for us

Similar Papers

Polygenic risk score association with cognitive decline in Parkinson’s Disease
Joshua Harvey ... Rick A Reijnders
Alzheimer's & Dementia | VOL. 18
Joshua Harvey, et. al.Joshua Harvey ... Rick A Reijnders
01 Dec 2022
Alzheimer's & Dementia | VOL. 18

Using Incomplete Trios to Boost Confidence in Family Based Association Studies.
Varsha Dhankani ... Joseph Vockley
Frontiers in genetics | VOL. 7
Varsha Dhankani, et. al.Varsha Dhankani ... Joseph Vockley
18 Mar 2016
Frontiers in genetics | VOL. 7

Recent insights into the pathogenesis of hyperuricaemia and gout
P L Riches ... A F Wright
Human Molecular Genetics | VOL. 18
P L Riches, et. al.P L Riches ... A F Wright
06 Oct 2009
Human Molecular Genetics | VOL. 18

Tobacco, Genetic Susceptibility and Lung cancer
Ravindran Ankathil
Tobacco Use Insights | VOL. 3
Ravindran AnkathilRavindran Ankathil
01 Jan 2009
Tobacco Use Insights | VOL. 3

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Genetic Classification of Populations Using Supervised Learning

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: PLoS ONE