Abstract

Public repositories of large-scale omics datasets represent a valuable resource for researchers. In fact, data re-analysis can either answer novel questions or provide critical data able to complement in-house experiments. However, despite the development of standards for the compilation of metadata, the identification and organization of samples still constitutes a major bottleneck hampering data reuse. We introduce Onassis, an R package within the Bioconductor environment providing key functionalities of Natural Language Processing (NLP) tools. Leveraging biomedical ontologies, Onassis greatly simplifies the association of samples from large-scale repositories to their representation in terms of ontology-based annotations. Moreover, through the use of semantic similarity measures, Onassis hierarchically organizes the datasets of interest, thus supporting the semantically aware analysis of the corresponding omics data. In conclusion, Onassis leverages NLP techniques, biomedical ontologies, and the R statistical framework, to identify, relate, and analyze datasets from public repositories. The tool was tested on various large-scale datasets, including compendia of gene expression, histone marks, and DNA methylation, illustrating how it can facilitate the integrative analysis of various omics data.

Highlights

  • The plummeting cost of high-throughput sequencing experiments has led to a rapid accumulation of omics datasets in public repositories

  • The use of biomedical ontologies is typically restricted to the computer science domain, and with the exclusion of the popular Gene Ontology, they rarely reach the community of biologists, while this would greatly benefit from their support

  • With a process known as named entity recognition, Onassis associates free textual descriptions of publicly available samples to the concepts belonging to ontologies where entities of a given domain of interest are associated to a standard representation

Read more

Summary

Onassis Description

Onassis is available as a package within the R/Bioconductor project[14], a very popular software repository for the analysis of genomic data, used by both bioinformaticians and biologists. Once the semantic information is associated to the samples (based, for example, on the annotation of samples metadata with cell lines and disease conditions), Onassis uses it within the compare function, in order to direct the analysis of the actual omics data (Fig. 1). This requires that the omics data are stored within a score matrix, whose rows represent genomic units and whose columns represent samples. The following use cases will illustrate these analyses in detail

Use Cases
Discussion
Additional information
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call