A Study of Cell-free DNA Fragmentation Pattern and Its Application in DNA Sample Type Classification.

Shifu Chen,Xiaoni Zhang,Ming Liu,Mingyan Xu,Yixing Wang,Jia Gu,Shiwei Zhang,Yue Han,Renwen Long

doi:10.1109/tcbb.2017.2723388

Abstract

Plasma cell-free DNA (cfDNA) has certain fragmentation patterns, which can bring non-random base content curves of the sequencing data's beginning cycles. We studied the patterns and found that we could determine whether a sample is cfDNA or not by just looking into the first 10 cycles of its base content curves. We analysed 3189 FastQ files, including 1442 cfDNA, 1234 genomic DNA, 507 FFPE tumour DNA and 6 urinary cfDNA. By deep analysing these data, we find the patterns are stable enough to distinguish cfDNA from other kinds of DNA samples. Based on this finding, we build classification models to recognise cfDNA samples by their sequencing data. Pattern recognition models are then trained with different classification algorithms like k-nearest neighbours (KNN), random forest and support vector machine (SVM). The result of 1000 iteration .632+ bootstrapping shows that all these classifiers can give an average accuracy higher than 98%, indicating that the cfDNA patterns are unique and can make the dataset highly separable. The best result is obtained using random forest classifier with a 99.89% average accuracy (σ = 0.00068). A tool called CfdnaPattern (http://github.com/OpenGene/CfdnaPattern) has been developed to train the model and to predict whether a sample is cfDNA or not.

Full Text