Cataloguing experimentally confirmed 80.7\xa0kb-long ACKR1 haplotypes from the 1000 Genomes Project database

Willy Albert Flegel,Bo Lan,Kshitij Srivastava,Anne-Sophie Fratzscher

doi:10.1186/s12859-021-04169-6

Willy Albert Flegel, Bo Lan

Open Access

https://doi.org/10.1186/s12859-021-04169-6

Copy DOI

Abstract

BackgroundClinically effective and safe genotyping relies on correct reference sequences, often represented by haplotypes. The 1000 Genomes Project recorded individual genotypes across 26 different populations and, using computerized genotype phasing, reported haplotype data. In contrast, we identified long reference sequences by analyzing the homozygous genomic regions in this online database, a concept that has rarely been reported since next generation sequencing data became available.Study design and methodsPhased genotype data for a 80.6 kb region of chromosome 1 was downloaded for all 2,504 unrelated individuals of the 1000 Genome Project Phase 3 cohort. The data was centered on the ACKR1 gene and bordered by the CADM3 and FCER1A genes. Individuals with heterozygosity at a single site or with complete homozygosity allowed unambiguous assignment of an ACKR1 haplotype. A computer algorithm was developed for extracting these haplotypes from the 1000 Genome Project in an automated fashion. A manual analysis validated the data extracted by the algorithm.ResultsWe confirmed 902 ACKR1 haplotypes of varying lengths, the longest at 80,584 nucleotides and shortest at 1,901 nucleotides. The combined length of haplotype sequences comprised 19,895,388 nucleotides with a median of 16,014 nucleotides. Based on our approach, all haplotypes can be considered experimentally confirmed and not affected by the known errors of computerized genotype phasing.ConclusionsTracts of homozygosity can provide definitive reference sequences for any gene. They are particularly useful when observed in unrelated individuals of large scale sequence databases. As a proof of principle, we explored the 1000 Genomes Project database for ACKR1 gene data and mined long haplotypes. These haplotypes are useful for high throughput analysis with next generation sequencing. Our approach is scalable, using automated bioinformatics tools, and can be applied to any gene.

Highlights

Data generated by generation sequencing (NGS) are often utilized in the emerging fields of precision and personalized medicine
As a proof of principle, we explored the 1000 Genomes Pro‐ ject database for ACKR1 gene data and mined long haplotypes
These haplotypes are useful for high throughput analysis with generation sequencing

Summary

Introduction

Data generated by generation sequencing (NGS) are often utilized in the emerging fields of precision and personalized medicine. Genotype phasing has often been inferred using computational methods [2, 3], which are prone to certain types of error [4] These errors are encountered in samples harboring novel variants, low frequency or rare variants, and structural variants [5]. Almost all of these errors can be precluded by laboratory based methods, such as sequencing the genomes of both parents and sibling offspring [6], physical separation of homologous chromosomes in diploid cells [7, 8], sequencing in sperm cells [9], allele specific PCR [10], single DNA molecule dilution [11] and single molecule sequencing chemistry [12, 13]. We identified long reference sequences by analyzing the homozygous genomic regions in this online database, a concept that has rarely been reported since generation sequencing data became available

Methods

Results

Conclusion