Abstract
Public repositories of large-scale omics datasets represent a valuable resource for researchers. In fact, data re-analysis can either answer novel questions or provide critical data able to complement in-house experiments. However, despite the development of standards for the compilation of metadata, the identification and organization of samples still constitutes a major bottleneck hampering data reuse. We introduce Onassis, an R package within the Bioconductor environment providing key functionalities of Natural Language Processing (NLP) tools. Leveraging biomedical ontologies, Onassis greatly simplifies the association of samples from large-scale repositories to their representation in terms of ontology-based annotations. Moreover, through the use of semantic similarity measures, Onassis hierarchically organizes the datasets of interest, thus supporting the semantically aware analysis of the corresponding omics data. In conclusion, Onassis leverages NLP techniques, biomedical ontologies, and the R statistical framework, to identify, relate, and analyze datasets from public repositories. The tool was tested on various large-scale datasets, including compendia of gene expression, histone marks, and DNA methylation, illustrating how it can facilitate the integrative analysis of various omics data.
Highlights
The plummeting cost of high-throughput sequencing experiments has led to a rapid accumulation of omics datasets in public repositories
The use of biomedical ontologies is typically restricted to the computer science domain, and with the exclusion of the popular Gene Ontology, they rarely reach the community of biologists, while this would greatly benefit from their support
With a process known as named entity recognition, Onassis associates free textual descriptions of publicly available samples to the concepts belonging to ontologies where entities of a given domain of interest are associated to a standard representation
Summary
Onassis is available as a package within the R/Bioconductor project[14], a very popular software repository for the analysis of genomic data, used by both bioinformaticians and biologists. Once the semantic information is associated to the samples (based, for example, on the annotation of samples metadata with cell lines and disease conditions), Onassis uses it within the compare function, in order to direct the analysis of the actual omics data (Fig. 1). This requires that the omics data are stored within a score matrix, whose rows represent genomic units and whose columns represent samples. The following use cases will illustrate these analyses in detail
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.