Abstract

Tandem duplication (TD) is an important type of structural variation (SV) in the human genome and has biological significance for human cancer evolution and tumor genesis. Accurate and reliable detection of TDs plays an important role in advancing early detection, diagnosis, and treatment of disease. The advent of next-generation sequencing technologies has made it possible for the study of TDs. However, detection is still challenging due to the uneven distribution of reads and the uncertain amplitude of TD regions. In this paper, we present a new method, DINTD (Detection and INference of Tandem Duplications), to detect and infer TDs using short sequencing reads. The major principle of the proposed method is that it first extracts read depth and mapping quality signals, then uses the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm to find the possible TD regions. The total variation penalized least squares model is fitted with read depth and mapping quality signals to denoise signals. A 2D binary search tree is used to search the neighbor points effectively. To further identify the exact breakpoints of the TD regions, split-read signals are integrated into DINTD. The experimental results of DINTD on simulated data sets showed that DINTD can outperform other methods for sensitivity, precision, F1-score, and boundary bias. DINTD is further validated on real samples, and the experiment results indicate that it is consistent with other methods. This study indicates that DINTD can be used as an effective tool for detecting TDs.

Highlights

  • Genome structural variations (SVs) are polymorphic rearrangements of 50 base pairs or greater in length and affect about 0.5% of the genome of a given individual (Eichler, 2012)

  • The DINTD software is implemented in Python language based on the methods described above, and the code is publicly available at https://github.com/SVanalysis/DINTD

  • Since there is no single answer in real samples, the overlapping density score (Yuan et al, 2018) for the results among the methods is analyzed to show the reliability of DINTD

Read more

Summary

Introduction

Genome structural variations (SVs) are polymorphic rearrangements of 50 base pairs or greater in length and affect about 0.5% of the genome of a given individual (Eichler, 2012). SVs include deletions, insertions, duplications, inversions, and translocations of segments of DNA (Balachandran and Beck, 2020). Tandem duplication (TD) is defined as a structure rearrangement whereby a segment of DNA is duplicated and inserted serially to the original segment (Olivier et al, 2003). Whole-genome sequencing (WGS) data from tumors have revealed that massive rearrangements, as in the tandem duplicator phenotype, are a specific cancer phenotype (Inaki and Liu, 2012). TDs commonly occur in some cancers (Stephens et al, 2009), in ovarian and breast cancer genomes. A subset of Detection and Inference of TDs ovarian cancer samples shares a marked TD phenotype with triple-negative breast cancers (Mcbride et al, 2012). TDs play an important role in the mechanism of human disease, the detection of which has great significance for genome analysis and the study of human evolution

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call