Deep learning emerges as a promising technique, utilizing nonlinear transformations for feature extraction from high-dimensional datasets. However, its application encounters challenges in genome-wide association studies (GWAS) dealing with high-dimensional genomic data. This study introduces an innovative three-step method termed SWAT-CNN for the identification of genetic variants. This approach employs deep learning to pinpoint phenotype-related single nucleotide polymorphisms (SNPs), facilitating the development of precise disease classification models. In the first step, the entire genome undergoes division into non overlapping fragments of an optimal size. Subsequently, convolutional neural network (CNN) analysis is conducted on each fragment to identify phenotype-associated segments. The second step, employs a Sliding Window Association Test (SWAT), where CNN is utilized on the selected fragments to compute phenotype influence scores (PIS) and detect phenotype-associated SNPs based on these scores. The third step involves running CNN on all identified SNPs to construct a comprehensive classification model. Validation of the proposed approach utilized GWAS data from the Alzheimer’s disease Neuroimaging Initiative (ADNI), encompassing 981 subjects, including cognitively normal older adults (CN) and individuals with Alzheimer's disease (AD). Notably, the method successfully identified the widely recognized APOE region as the most significant genetic locus for AD. The resulting classification model exhibited an area under the curve (AUC) of 0.82, demonstrating compatibility with traditional machine learning approaches such as random forest and XGBoost. SWAT-CNN, as a groundbreaking deep learning-based genome-wide methodology, not only identified AD-associated SNPs but also presented a robust classification model for Alzheimer's disease, suggesting potential applications across diverse biomedical domains.
Read full abstract