Eye-color and Type-2 diabetes phenotype prediction from genotype data using deep learning methods

Muhammad Muneeb,Andreas Henschel

doi:10.1186/s12859-021-04077-9

Muhammad Muneeb, Andreas Henschel

Open Access

https://doi.org/10.1186/s12859-021-04077-9

Copy DOI

Abstract

BackgroundGenotype–phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. There are many approaches for finding the association which can be broadly categorized into two classes, statistical techniques, and machine learning. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. In this article, we examined the Eye-color and Type-2 diabetes phenotype. The proposed technique is a hybrid approach consisting of some parts from statistical techniques and remaining from Machine learning.ResultsThe main dataset for Eye-color phenotype consists of 806 people. 404 people have Blue-Green eyes where 402 people have Brown eyes. After preprocessing we generated 8 different datasets, containing different numbers of SNPs, using the mutation difference and thresholding at individual SNP. We calculated three types of mutation at each SNP no mutation, partial mutation, and full mutation. After that data is transformed for machine learning algorithms. We used about 9 classifiers, RandomForest, Extreme Gradient boosting, ANN, LSTM, GRU, BILSTM, 1DCNN, ensembles of ANN, and ensembles of LSTM which gave the best accuracy of 0.91, 0.9286, 0.945, 0.94, 0.94, 0.92, 0.95, and 0.96% respectively. Stacked ensembles of LSTM outperformed other algorithms for 1560 SNPs with an overall accuracy of 0.96, AUC = 0.98 for brown eyes, and AUC = 0.97 for Blue-Green eyes. The main dataset for Type-2 diabetes consists of 107 people where 30 people are classified as cases and 74 people as controls. We used different linear threshold to find the optimal number of SNPs for classification. The final model gave an accuracy of 0.97%.ConclusionGenotype–phenotype predictions are very useful especially in forensic. These predictions can help to identify SNP variant association with traits and diseases. Given more datasets, machine learning model predictions can be increased. Moreover, the non-linearity in the Machine learning model and the combination of SNPs Mutations while training the model increases the prediction. We considered binary classification problems but the proposed approach can be extended to multi-class classification.

Highlights

Genotype–phenotype predictions are of great importance in genetics
Why are we all different? Why our physical characteristics differ? The answer to this question lies in genetic variations in human beings [1]
When we applied an algorithm to any dataset, we considered different hyperparameters specific to that algorithm to find the optimal result. This whole pipeline is computationally expensive but reliable because after finding the best model which can be from machine learning or deep learning paradigm, we can use it to classify people based on genotype data into specific phenotype

Summary

Introduction

Genotype–phenotype predictions are of great importance in genetics. These predictions can help to find genetic mutations causing variations in human beings. Statistical techniques are good for finding the actual SNPs causing variation where Machine Learning techniques are good where we just want to classify the people into different categories. All humans are different from each other like our eye color and other physical characteristics. Alleles are forms of minor variations of the same gene in their sequence of DNA bases. These small differences contribute to each person’s distinctive physical characteristics [4]. If the person has two copies of it and does not have the dominant allele of that gene is a recessive allele expressed This pair of alleles is known as a genotype and determines the appearance or phenotype of the organism [5]

Methods

Results

Conclusion