Precise diagnosis of three top cancers using dbGaP data

Xu-Qing Liu,Xin-Sheng Liu,Xiao-Hu Luo,Hong-Yan Jiang,Feng Gao,Xiao-Feng Li,Ye-Qin Chen,Li-Li Xiao,Hai-Wen Chen,Cheng-Yao Ji,Chun-Hua Deng,Jun-Liang Li,Zhi-Guo Zhao,Yan-Dong Wu,Yu-Ting Liu,Yu Huang,Jian-Ying Rong,Wen-Wen Liu

doi:10.1038/s41598-020-80832-x

Xu-Qing Liu, Xin-Sheng Liu + Show 16 more

Open Access

PDF Available

https://doi.org/10.1038/s41598-020-80832-x

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

The challenge of decoding information about complex diseases hidden in huge number of single nucleotide polymorphism (SNP) genotypes is undertaken based on five dbGaP studies. Current genome-wide association studies have successfully identified many high-risk SNPs associated with diseases, but precise diagnostic models for complex diseases by these or more other SNP genotypes are still unavailable in the literature. We report that lung cancer, breast cancer and prostate cancer as the first three top cancers worldwide can be predicted precisely via 240–370 SNPs with accuracy up to 99% according to leave-one-out and 10-fold cross-validation. Our findings (1) confirm an early guess of Dr. Mitchell H. Gail that about 300 SNPs are needed to improve risk forecasts for breast cancer, (2) reveal an incredible fact that SNP genotypes may contain almost all information that one wants to know, and (3) show a hopeful possibility that complex diseases can be precisely diagnosed by means of SNP genotypes without using phenotypical features. In short words, information hidden in SNP genotypes can be extracted in efficient ways to make precise diagnoses for complex diseases.

Highlights

The challenge of decoding information about complex diseases hidden in huge number of single nucleotide polymorphism (SNP) genotypes is undertaken based on five dbGaP studies
It is of great practical significance to find methods of diagnosing complex diseases precisely based on single nucleotide polymorphism (SNP) g enotypes[2,3]
We first use Snp2Bin to transform these SNPs into 656,891 binary variables, apply IterMMPC to reduce attributes and obtain a 3274-variable subset, and employ OptNBC to get a 268-feature NBC (Fig. 2A and Fig. S1A)

Summary

Introduction

The challenge of decoding information about complex diseases hidden in huge number of single nucleotide polymorphism (SNP) genotypes is undertaken based on five dbGaP studies. High-throughput sequencing technology helps us get more and more molecular data, and poses challenges on how to use these rich resources e fficiently[1] Among these challenges, it is of great practical significance to find methods of diagnosing complex diseases precisely based on single nucleotide polymorphism (SNP) g enotypes[2,3]. It is of great practical significance to find methods of diagnosing complex diseases precisely based on single nucleotide polymorphism (SNP) g enotypes[2,3] This challenge has become a shackle to current genome-wide association (GWA) studies, and it may be the time to break it such that moving beyond the initial steps of GWA studies[4] will be no longer a hard work in the near future. We first use Snp2Bin (Fig. 1A and Algorithm S1; a key procedure) to transform SNPs into 2-value variables; apply IterMMPC (Fig. 1B and Algorithm S2) to reduce attributes; employ OptNBC (Fig. 1C and Algorithm S3) to get the optimal features for naive Bayes classifier (NBC18,19)

Results

Discussion

Conclusion