Bioinformatics Metadata Extraction for Machine Learning Analysis

Zachary Tom

doi:10.31979/etd.s3ta-b264

Abstract

Next generation sequencing (NGS) has revolutionized the biological sciences. Today, entire genomes can be rapidly sequenced, enabling advancements in personalized medicine, genetic diseases, and more. The National Center for Biotechnology Information (NCBI) hosts the Sequence Read Archive (SRA) containing vast amounts of valuable NGS data. Recently, research has shown that sequencing errors in conventional NGS workflows are key confounding factors for detecting mutations. Various steps such as sample handling and library preparation can introduce artifacts that affect the accuracy of calling rare mutations. Thus, there is a need for more insight into the exact relationship between various steps of the NGS workflow- the metadata- and sequencing artifacts. This paper presents a new tool called SRAMetadataX that enables researchers to easily extract crucial metadata from SRA submissions. The tool was used to identify eight sequencing runs that utilized hybrid capture or PCR for enrichment. A bioinformatics pipeline was built that identified 298,936 potential sequencing artifacts from the runs. Various machine learning models were trained on the data, and results showed that the models were able to predict enrichment method with about 70% accuracy, indicating that different enrichment methods likely produce specific sequencing artifacts.

Full Text