Abstract
Imputation is a powerful in silico approach to fill in those missing values in the big datasets. This process requires a reference panel, which is a collection of big data from which the missing information can be extracted and imputed. Haplotype imputation requires ethnicity-matched references; a mismatched reference panel will significantly reduce the quality of imputation. However, currently existing big datasets cover only a small number of ethnicities, there is a lack of ethnicity-matched references for many ethnic populations in the world, which has hampered the data imputation of haplotypes and its downstream applications. To solve this issue, several approaches have been proposed and explored, including the mixed reference panel, the internal reference panel and genotype-converted reference panel. This review article provides the information and comparison between these approaches. Increasing evidence showed that not just one or two genetic elements dictate the gene activity and functions; instead, cis-interactions of multiple elements dictate gene activity. Cis-interactions require the interacting elements to be on the same chromosome molecule, therefore, haplotype analysis is essential for the investigation of cis-interactions among multiple genetic variants at different loci, and appears to be especially important for studying the common diseases. It will be valuable in a wide spectrum of applications from academic research, to clinical diagnosis, prevention, treatment, and pharmaceutical industry.
Highlights
Imputation is a powerful in silico approach to fill in those missing values in the big datasets
A study showed that the highest imputation accuracy may be as high as 97.8%; and may be as low as 78.2% when the San population was imputed with a reference panel consisting of the entire CHB+JPT panel of 180 haplotypes [4]
We investigated the reason underlying this observation, and found that the size of sliding windows is usually much smaller than the segmental sizes of haplotype stretches between switching errors in the statistical phasing results; due to the relatively high accuracy within each haplotype stretch in the statistically resolved haplotypes, the imputation can extract correct information from each sliding window
Summary
A straightforward strategy to expand the haplotype references is to recruit human population samples from a wide-range of ethnic diversities and determine their molecular haplotypes. When the sample was ASW, the [YRI+CEU] reference panel performed better than cosmopolitan reference [YRI+MKK+GIH+MEX+CEU]; interestingly, when the internal reference was involved, the largest cosmopolitan reference panel [ASW+CEU+YRI +MKK+GIH+MEX] performed the worst, but a reference panel pooled by the seemingly unrelated cohorts [JPT+CHB] performed the best [24] It is unclear how this approach works for many untested populations and subpopulations yet. Only the cohorts from the ethnic populations that contribute to the admixture of the study population should be included in the pooled reference panel; a cosmopolitan panel does not always compromise the quality of imputation Another potential limitation is the computing speed, the larger number of ethnicities in the pooled reference panel, the higher computer burden for using this cosmopolitan panel for imputation in reality. This is an important issue that should not be ignored in the big data era
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: Molecular biology (Los Angeles, Calif.)
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.