Abstract

Advancements made in high-throughput sequencing technologies have continued to generate large amounts of sequencing data enabling the holistic investigation of complex biological phenomena. Genomic sequence data are used for a wide range of applications such as gene annotations, expression studies, personalized treatment and precision medicine. However, this rapid expansion in available sequence data poses a tremendous computational challenge, calling for the development of novel data processing and analytic methods, as well as computing resources to match the volume of these datasets. In this work, a machine- and statistical learning approach for classification based on k-mer representations of DNA sequence data is proposed. While targeted sequencing focuses on a specific region of interest, whole genome sequencing enables a view of a species’ entire genome. Thus, the approach is tested using whole genome sequences of Mycobacterium tuberculosis isolates to (i) reduce the size of genomic sequence data, (ii) identify an optimum size of k-mers and utilize it to build classification models, and (iii) predict the phenotype from whole genome sequence data of a given bacterial isolate. Furthermore, the computing challenges associated with whole genome sequence data analyses in producing interpretable and explainable insights are described. Classification models were trained using 104 Mycobacterium tuberculosis isolates. Cluster analyses showed that k-mers can be used to discriminate phenotypes and the discrimination becomes more concise as the k-mer size increases. The best performing classification model had a k-mer size of 10 (longest k-mer considered in this study) an accuracy, recall, precision, specificity, and Matthews Correlation coefficient of 72.0%, 80.5%, 80.5%, 63.6%, and 0.4, respectively. This study provides a comprehensive approach for resampling whole genome sequencing data, objectively selecting a k-mer size, and performing classification for phenotype prediction. The analysis also highlights the importance of increasing the k-mer size to produce more biologically explainable results, highlighting the interplay that exists between accuracy, computing resources such as processing and memory, and explainability of classification results. Furthermore, the analysis provides a new way to extract genetic information from genomic data and identify phenotype relationships which are integral for explaining complex biological mechanisms.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.