Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Wiktor Kuśmirek,Agnieszka Szmurło,Marek Wiewiórka,Tomasz Gambin,Robert Nowak

doi:10.1186/s12859-019-2889-z

Wiktor Kuśmirek, Agnieszka Szmurło + Show 3 more

Open Access

https://doi.org/10.1186/s12859-019-2889-z

Copy DOI

Abstract

BackgroundThere are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis.The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. The essential aspect of the entire process is the normalization stage, in which systematic errors and biases are removed and the reference sample set is used to increase the signal-to-noise ratio.Although some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature.To our knowledge, this work is the first attempt to assess the impact of reference sample set selection on CNV detection performance.MethodsWe used WES data from the 1000 Genomes project to evaluate the impact of various methods of reference sample set selection on CNV calling performance of three chosen state-of-the-art tools: CODEX, CNVkit and exomeCopy. Two naive solutions (all samples as reference set and random selection) as well as two clustering methods (k-means and k nearest neighbours (kNN) with a variable number of clusters or group sizes) have been evaluated to discover the best performing sample selection method.Results and ConclusionsThe performed experiments have shown that the appropriate selection of the reference sample set may greatly improve the CNV detection rate. In particular, we found that smart reduction of reference sample size may significantly increase the algorithms’ precision while having negligible negative effect on sensitivity. We observed that a complete CNV calling process with the k-means algorithm as the selection method has significantly better time complexity than kNN-based solution.

Highlights

There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis
Performance evaluation We have evaluated the quality of each pair of (i) reference set selection algorithm and (ii) CNV calling tool, comparing the output CNV call set of the solution and the CNV call set golden record provided by 1000 Genomes Consortium [9] generated based on the Whole Genome Sequencing (WGS) data
We have shown that proper reference sample set selection leads to improved sensitivity and precision for all considered CNV callers

Summary

Introduction

There are over 25 tools dedicated for the detection of Copy Number Variants (CNVs) using Whole Exome Sequencing (WES) data based on read depth analysis. The tools reported consist of several steps, including: (i) calculation of read depth for each sequencing target, (ii) normalization, (iii) segmentation and (iv) actual CNV calling. Some CNV calling tools use dedicated algorithms to obtain the optimal reference sample set, most of the advanced CNV callers do not include this feature. To minimize the effect of technological biases, CNV calling algorithms are required to take into account the depth of coverage in other samples (reference sample set) and the influence of known sources of noise, including but not limited to reads mappability and GC content in target regions. Segmentation and actual CNV calling are applied, which produces a set of putative deletions and duplications

Methods

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: May 28, 2019
Citations: 13	License type: open-access

R Discovery Prime

R Discovery Prime

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Short Read (Next-Generation) Sequencing
Jaya Punetha ... Eric P Hoffman
Circulation: Cardiovascular Genetics | VOL. 6
Jaya Punetha, et. al.Jaya Punetha ... Eric P Hoffman
14 Jul 2013
Circulation: Cardiovascular Genetics | VOL. 6

CNVind: an open source cloud-based pipeline for rare CNVs detection in whole exome sequencing data based on the depth of coverage
Wiktor Kuśmirek ... Robert Nowak
BMC Bioinformatics | VOL. 23
Wiktor Kuśmirek, et. al.Wiktor Kuśmirek ... Robert Nowak
05 Mar 2022
BMC Bioinformatics | VOL. 23

Abstract 396: Detecting copy number variations using WES datasets in patient derived xenografts
Jia Xue ... Jie Cai
Cancer Research | VOL. 77
Jia Xue, et. al.Jia Xue ... Jie Cai
01 Jul 2017
Cancer Research | VOL. 77

An evaluation of copy number variation detection tools for cancer using whole exome sequencing data
Fatima Zare ... Michelle Dow
BMC Bioinformatics | VOL. 18
Fatima Zare, et. al.Fatima Zare ... Michelle Dow
31 May 2017
BMC Bioinformatics | VOL. 18

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Comparison of kNN and k-means optimization methods of reference set selection for improved CNV callers performance

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics