A machine learning framework for genotyping the structural variations with copy number variant

Tian Zheng,Jiayin Wang,Xin Yi,Xiaoyan Zhu,Zhongmeng Zhao,Xuanping Zhang,Hongle Li

doi:10.1186/s12920-020-00733-w

Abstract

BackgroundGenotyping of structural variation is an important computational problem in next generation sequence data analysis. However, in cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. The bias on sequencing coverage and variant allelic frequency can be observed on a CNV region, which leads to the genotyping approaches that misinterpret the heterozygote as a homozygote. Furthermore, other data signals such as split mapped read, abnormal read will also be misjudged because of the CNV. Therefore, genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions.MethodsHere we proposed a computational method for genotyping indels in the CNV region, which introduced a machine learning framework to comprehensively incorporate a set of data features and their interactions. We extracted fifteen kinds of classification features as input and different from the traditional genotyping problem, here the structure of variant may fall into types of normal homozygote, homozygous variant, heterozygous variant without CNV, heterozygous variant with a CNV on the mutated haplotype, and heterozygous variant with a CNV on the wild haplotype. The Multiclass Relevance Vector Machine (M-RVM) was used as a machine learning framework combined with the distribution characteristics of the features.ResultsWe applied the proposed method to both simulated and real data, and compared it with the existing popular softwares include Gindel, Facets, GATK, and also compared with other machine learning cores: Support Vector Machine, Lanrange-SVM with OVO multiple classification, Naïve Bayes and BP Neural Network. The results demonstrated that the proposed method outperforms others on accuracy, stability and efficiency.ConclusionThis work shows that the genotyping of structural variations on the CNV region cannot be solved as a traditional genotyping problem. More features should be used to efficiently complete the five-category task. According to the result, the proposed method can be a practical algorithm to correct genotype structural variations with CNV on the next generation sequence data. The source codes have been uploaded at https://github.com/TrinaZ/Mixgenotypefor academic usage only.

Highlights

Genotyping of structural variation is an important computational problem in generation sequence data analysis
Based on the distribution probability of copy number variant (CNV), we created 60 Type N calls, 80 Type G1 calls, 80 Type G2 candidates, 50 Type G3 candidates and 30 Type G4 calls (CNV occur in wild haplotype heterozygote)
The accuracy refers to the ratio of the number of samples correctly classified by the classifier to the total number of samples for a given test dataset, and the relevant vector refers to the nonzero parameter corresponding point, which reflects the characteristics of the training data onto the reason that most of the parameters of the posterior distribution tend to zero and has nothing to do with forecast [29]

Summary

Introduction

Genotyping of structural variation is an important computational problem in generation sequence data analysis. In cancer genomes, the copy number variant(CNV) often coexists with other types of structural variations which significantly reduces the accuracy of the existing genotype methods. Genotyping the structural variations with CNV is a complicated computational problem which should consider multiple features and their interactions. Structural variations(SVs) generally refer to cytogenetically visible and submicroscopic variants, including insertion, deletion, inversion, copy number variant and etc [1, 2]. The genotype of SVs, known as genotype analysis, is a technique to determine whether the structural variation is heterozygous or homozygous [3]. Obtaining the accurate genotypes of SVs can be widely used in downstream analysis, such as imputing genotypes [4], estimating genomic diversity [5], calculating linkage disequilibrium [6] and clinical practices including disease diagnosis [7], treatment management [8] and drug design [9]

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Medical Genomics	Publication Date: Aug 1, 2020
Citations: 1	License type: open-access

R Discovery Prime

R Discovery Prime

A machine learning framework for genotyping the structural variations with copy number variant

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Genomics

Lead the way for us

Similar Papers

Identification of copy number variations using high density whole-genome SNP markers in Chinese Dongxiang spotted pigs.
Chengbin Wang ... Hao Chen
Asian-Australasian Journal of Animal Sciences | VOL. 32
Chengbin Wang, et. al.Chengbin Wang ... Hao Chen
07 Feb 2019
Asian-Australasian Journal of Animal Sciences | VOL. 32

Copy number variation in human genomes from three major ethno-linguistic groups in Africa
Oscar A Nyangiri ... Enock Matovu
BMC Genomics | VOL. 21
Oscar A Nyangiri, et. al.Oscar A Nyangiri ... Enock Matovu
10 Apr 2020
BMC Genomics | VOL. 21

Analysis of copy number variation in dogs implicates genomic structural variation in the development of anterior cruciate ligament rupture.
Emily E Binversie ... Corinne D Engelman
PloS one | VOL. 15
Emily E Binversie, et. al.Emily E Binversie ... Corinne D Engelman
31 Dec 2020
PloS one | VOL. 15

Analysis of copy number variations in the sheep genome using 50K SNP BeadChip array
Jiasen Liu ... Lixin Du
BMC Genomics | VOL. 14
Jiasen Liu, et. al.Jiasen Liu ... Lixin Du
08 Apr 2013
BMC Genomics | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A machine learning framework for genotyping the structural variations with copy number variant

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Medical Genomics