Abstract
Next-generation sequencing (NGS) of mitochondrial DNA (mtDNA) has widespread applications in aging and cancer studies. However, cross-contamination of mtDNA constitutes a major concern. Previous methods for the detection of mtDNA contamination mainly focus on haplogroup-level phylogeny, but neglect haplotype-level differences, leading to limited sensitivity and accuracy. In our study, we present mitoDataclean, a random-forest-based machine learning package for accurate identification of cross-contamination, evaluation of contamination levels and detection of contamination-derived variants in mtDNA NGS data. Comprehensive optimization of mitoDataclean revealed that training simulation with mixtures of small haplogroup distance and low polymorphic difference was critical for optimal modeling. Compared to existing methods, mitoDataclean exhibited significantly improved sensitivity and accuracy for the detection of sample contamination in simulated data. In addition, mitoDataclean achieved area under the curve values of 0.91 and 0.97 for discerning genuine and contamination-derived mtDNA variants in a simulated Western dataset and private sequencing contamination data, respectively, suggesting that this tool may be applicable for different populations and samples with different sources of contamination. Finally, mitoDataclean was further evaluated in several private and public datasets and showed a robust ability for contamination detection. Altogether, our study demonstrates that mitoDataclean may be used for accurate detection of contaminated samples and contamination-derived variants in mtDNA NGS data.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.