Towards data cleaning in large public biological databases

Hamid Bagheri

doi:10.31274/etd-20210609-9

Abstract

As the cost of sequencing decreases, the amount of data being deposited into public repositories isincreasing rapidly. As sequencing data continues to accumulate in the online repositories, scientists can increasingly use multi-tiered data to better answer biological questions. One main challenge that the public biological repositories have is the problem of data quality of the metadata. Unfortunately, most public databases do not have methods for identifying errors in their metadata, leading to the potential for error propagation. In order to do the cleaning at the large scale, scalable infrastructure and algorithms are needed to be developed. In this dissertation, we built a domain-specific language and large-scale infrastructure, called BoaG, to analyze the wealth of genomics data. We used the BoaG’s interface to reason about the provenance, frequencies, and quality of annotations. The second part of the dissertation focuses on the cleaning of the public repositories at scale. Most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of taxonomic misclassification in the entire database has not been quantified. We proposed and developed an automatic approach to detect and remove the suspicious taxonomic assignments and mispredicted functional annotations. We also addressed widely used sequence clustering information of the public databases. The usefulness of clusters to explore different biological analyses has been shown for functional annotation, family classification, systems biology, structural genomics, and phylogenetic analysis [73]. We utilized CD-HIT [33] to cluster NR sequences at different similarity levels, i.e. 95%, 90%, 85%, down to 65%. To improve the data quality of the clusters, we removed anomalies and then provided a confidence score based on the lineage of all sequences within each cluster. For the functional annotations, we utilized protein ontology (PRO) [58] and Gene Ontology [11] that are knowledge-based graphs to detect potentially mispredicted functions. Ontologies have been utilized to express knowledge. In this dissertation, we leveraged them to improve the quality of the public genomics databases. We proposed a computational method that abstracts ontology graphs into a lower-dimensional network representation that makes reasoning for inconsistencies among the list of functional annotations easier. We found that the BoaG infrastructure provided fewer lines of code, reduced storage size, and

Full Text