Abstract

AbstractBackgroundThe Genome Center for Alzheimer’s Disease (GCAD) coordinates the integration and meta‐analysis of all available Alzheimer’s disease (AD) relevant whole genome sequencing (WGS) data with the goal of identifying AD risk or protective genetic variants and eventual therapeutic targets. The WGS datasets are generated via the collaboration of scientists from the Alzheimer’s Disease Sequencing Project (ADSP) and GCAD. With the vision to minimize data heterogeneity, introduced by different sequencing protocols and machines, GCAD processes all samples using identical pipelines and performs quality assurance (QA) checks.MethodsRaw sequencing data (FASTQs or BAMs) were aligned to GRCh38/hg38 by BWA, and variant calling and joint genotyping were done by GATK. Furthermore, Smoove, Manta and Streka were applied to generate structural variant (SV) calls per sample. QA checks including sex, contamination and genotype concordance as well as the ADSP QC protocol were performed to evaluate the quality of samples and variants. To facilitate the access and usage of the big joint‐genotyped VCF files, we introduced a compact version for storing variant info and sample genotypes only.ResultsWe dropped 235 (1.3%) samples of poor coverage (<20x) or that failed QA checks, and we flagged 173 (1.0%) samples that were of borderline quality. As a result, the dataset (ADSP Release 3, 2021) includes 16,905 genomes from 17 diverse cohorts with 3 major ethnicities: 10,651 Non‐Hispanic Whites, 3,212 Hispanics and 2,874 African Americans. Data are deeply sequenced (average genome coverage: >30x). All samples’ CRAMs, gVCFs from GATK, and VCFs from the three SV callers were deposited into NIAGADS Data Sharing Service (DSS) (https://dss.niagads.org/) for public distribution. In addition, joint‐genotype VCFs are available in both compact and QC versions. This joint‐genotype VCF contains >206M bi‐allelic single‐nucleotide variants, 16M bi‐allelic indels and 28M multi‐allelic variants, with 96% of variants remaining after stringent QC.ConclusionThe ADSP and GCAD generate high quality genotype calls and SV calls. Currently the project is processing ∼37,000 WGS samples sequenced primarily through the ADSP Follow‐Up Study, which will contain a more ancestrally diverse set of populations. We anticipate this 2022 release will continue to benefit the research community studying AD genetics.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call