Abstract

Plasma cell-free DNA (cfDNA) has certain fragmentation patterns, which can bring non-random base content curves of the sequencing data's beginning cycles. We studied the patterns and found that we could determine whether a sample is cfDNA or not by just looking into the first 10 cycles of its base content curves. We analysed 3189 FastQ files, including 1442 cfDNA, 1234 genomic DNA, 507 FFPE tumour DNA and 6 urinary cfDNA. By deep analysing these data, we find the patterns are stable enough to distinguish cfDNA from other kinds of DNA samples. Based on this finding, we build classification models to recognise cfDNA samples by their sequencing data. Pattern recognition models are then trained with different classification algorithms like k-nearest neighbours (KNN), random forest and support vector machine (SVM). The result of 1000 iteration .632+ bootstrapping shows that all these classifiers can give an average accuracy higher than 98%, indicating that the cfDNA patterns are unique and can make the dataset highly separable. The best result is obtained using random forest classifier with a 99.89% average accuracy (σ = 0.00068). A tool called CfdnaPattern (http://github.com/OpenGene/CfdnaPattern) has been developed to train the model and to predict whether a sample is cfDNA or not.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call