A turning point in cancer research is the introduction of massively parallel sequencing technology which greatly reduced the cost and time for genome sequencing. This enhanced the scope for detecting and analyzing the role of structural alterations in cancer. However, certain bias exists in NGS-based approaches, which badly affects the CNV identification process. Moreover, DNA repeats existing in CNV regions need special attention as they will degrade the performance of majority of the existing CNV detection tools, even after applying generalized bias correction method. This motivated this work, where a novel method has been designed to address the issue of DNA repeats and thereby mappability bias existing in regions of CNV.The method consists of three phases, where the first phase computes the alignment information of uniquely mapped DNA reads, considering the base quality and base mismatch parameters at nucleotide level precision. The second and the third phase use a novel approach to allocate the non-uniquely mapped reads to an optimal region of the DNA repeats based on a probabilistic membership model. The proposed method is capable of identifying CNVs present in coding, as well as non-coding region of the DNA, and is also capable of detecting CNVs existing in DNA repeat regions. The methodology achieves a sensitivity greater than [Formula: see text] during the performed simulations, and on real data, the detected variants are validated with the database of genomic variants, where the percentage overlap is also greater than 95%, and has achieved much better breakpoint prediction, as compared with other popular bias correction CNV detection methods.
Read full abstract