Abstract

Background. Since the publication of the first journal in 1662, “Philosophical Transactions of the Royal Society of London”, scientific publications have been used to communicate scholarly work, including hypotheses, methods, experiments and results. Despite of the availability of electronic formats and advances on information retrieval supported by public repositories such as PubMed, scientific publications remain poorly connected to each other as well as to external resources. In fact, most of the information remains locked up in discrete documents which makes it difficult to integrate it to automatic processes and workflows. With the continuous growth of scientific publications, more than 1.2 million articles published in PubMed during 2016, benefitting from scientific literature without a machine-processable infrastructure poses a major challenge to researchers. Finding relevant publications for a particular research topic is one of the areas where machine-processable content would make a difference. Although a list of recommended publications –i.e., related regarding their content, is offered by some repositories such as PubMed or Elsevier, no similarity score nor the terms participating in the relation are provided, making it difficult to understand how recommended articles relate to each other. The Linked Open Data initiative together with semantic technologies provide a connectivity tissue that has not yet been fully used to support the generation of self-describing, semantic and machine-processable documents. The availability of linked data on top of the digital form currently adopted by scientific publications should facilitate knowledge retrieval, making it possible finding out relations and facts otherwise hidden or difficult to grasp. Furthermore, it should facilitate approaches working on full-text rather than just title-and-abstract. Results. Here we present Biotea, our approach to semantically generate self-describing, machine-processable scholarly documents. We initially define a Resource Description Framework (RDF) model to integrate metadata and content from scientific publications into the Linked Open Data cloud. We enrich this infrastructure with a semantic annotation process, meaning we extract terms and expressions from the documents and connect them to ontological concepts. Our RDF model makes extensive use of existing ontologies and semantic enrichment services. We have applied our model to the full-text, open-access subset of PubMed Central. Biolinks is built on top of Biotea. We initially propose a reclassification of the Unified Medical Language System (UMLS) semantic groups. Such reclassification is later used to semantically characterize documents as well as relations between scientific publications. A semantic model is defined for both the characterization of the similarity as well as the processes required to apply the Biolinks principles to any publication following the Journal Article Tag Suite format or the RDF model defined by Biotea. Biolinks has been applied to a subset of documents in the TREC-05 Genomics Track collection, which have been annotated with UMLS concepts. On top of these annotated documents, we have added a distribution score according to semantic profiles. Our models and processes are open-access and publicly available in GitHub (see https://github.com/biotea and https://github.com/ljgarcia/biotea-biolinks). The data produced by applying Biotea to PubMed Central Open Access is also public (see http://doi.org/10.5281/zenodo.376814) as well as the data generated from applying Biolinks to the TREC-05 Genomics Track Collection (see http://doi.org/10.5281/zenodo.290371). Conclusions. The semantic processing of the biomedical literature supported by Biotea makes it possible to integrate scholarly communications to the Linked Open Data cloud. Biotea also delivers a flexible and adaptable set of tools for metadata enrichment and semantic processing of scientific publications. In such a way, Biotea provides a semantic-based scaffolding that should make it easier benefiting from the myriad of documents currently published. Biolinks is an example of the possible benefits opened up thanks to Biotea. With the semantic characterization and similarity scores, Biolinks provides tools that make it easier to researchers to understand the general subject of a publication as well as how it relates to other publications. The weighting and similarity processes can be narrowed to a subset of the semantic groups, enabling researchers to focus on what is more relevant to them. Biolinks also contributes to understanding differences when working with only title-and-abstract versus full-text. To sum up, Biotea together with Biolinks contribute to enable literature-based knowledge discovery from a semantic perspective.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call