Abstract

AbstractBackgroundAlzheimer’s disease (AD) genetic research utilizing genome‐wide association study data often requires imputation. Genotypes are imputed probabilistically by the Hidden Markov Model and converted into one‐dimensional categorical variables for a convolutional neural network (CNN). Such a conversion, however, contains uncertainty, which can mislead CNN in determining AD‐risk factors when added up. Also, the imputed genotypes are quality‐controlled (QCed) by using PLINK, a widely used whole genome data analysis toolset. However, the QC process of PLINK cannot exclude all incorrectly imputed multiallelic genetic variants (IMGs). To address such issues, we proposed a post‐quality control algorithm to transform the genetic dataset into three‐dimensional one‐hot encoding weighted by genotype posterior probabilities.MethodWe collected 1876, 173, and 384 participants from the Alzheimer’s Disease Neuroimaging Initiative, Imaging and Genetic Biomarkers for AD, and Longitudinal Early‐Onset Alzheimer’s Disease Study, respectively. Genotypes were imputed through Markov Chain Haplotyping and QCed using PLINK (Fig. 1). Then the data was transformed into 5×2xN three‐dimensional one‐hot encoding; 5 indicates the number of nucleotides (4) including insertion (1), 2 indicates genotype copy, and N indicates the number of all available markers. One‐hot encoding data was weighted by genotype posterior probabilities obtained by Genome Analysis Toolkit. The weighting algorithm also re‐imputes IMGs based on their genotype posterior probabilities (Fig. 2).ResultAll genotypes were successfully transformed into one‐hot encodings composed of genotype posterior probabilities. Each participant had about 6,980 IMGs with a standard deviation of 822 (Fig. 3). The percentage of IMGs for each chromosome was about 0.274% with a standard deviation of 0.234 (Table 1). All IMGs were corrected based on genotype posterior probabilities and transformed into one‐hot encodings (Fig. 4).ConclusionThe QC process is essential for minimizing possible bias and errors that would be imposed on computational models. Transforming genetic data into one‐hot encodings weighted by genotype posterior probability can present genotypes with their imputation quality. Furthermore, it can fully utilize CNN’s capacity by allowing it to determine various features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call