SPRENO: a BioC module for identifying organism terms in figure captions.

Hong-Jie Dai,Onkar Singh

doi:10.1093/database/bay048

Abstract

Recent advances in biological research reveal that the majority of the experiments strive for comprehensive exploration of the biological system rather than targeting specific biological entities. The qualitative and quantitative findings of the investigations are often exclusively available in the form of figures in published papers. There is no denying that such findings have been instrumental in intensive understanding of biological processes and pathways. However, data as such is unacknowledged by machines as the descriptions in the figure captions comprise of sumptuous information in an ambiguous manner. The abbreviated term ‘SIN’ exemplifies such issue as it may stand for Sindbis virus or the sex-lethal interactor gene (Drosophila melanogaster). To overcome this ambiguity, entities should be identified by linking them to the respective entries in notable biological databases. Among all entity types, the task of identifying species plays a pivotal role in disambiguating related entities in the text. In this study, we present our species identification tool SPRENO (Species Recognition and Normalization), which is established for recognizing organism terms mentioned in figure captions and linking them to the NCBI taxonomy database by exploiting the contextual information from both the figure caption and the corresponding full text. To determine the ID of ambiguous organism mentions, two disambiguation methods have been developed. One is based on the majority rule to select the ID that has been successfully linked to previously mentioned organism terms. The other is a convolutional neural network (CNN) model trained by learning both the context and the distance information of the target organism mention. As a system based on the majority rule, SPRENO was one of the top-ranked systems in the BioCreative VI BioID track and achieved micro F-scores of 0.776 (entity recognition) and 0.755 (entity normalization) on the official test set, respectively. Additionally, the SPRENO-CNN exhibited better precisions with lower recalls and F-scores (0.720/0.711 for entity recognition/normalization). SPRENO is freely available at https://bigodatamining.github.io/software/201801/.Database URL: https://bigodatamining.github.io/software/201801/

Highlights

To facilitate a better understanding of the fundamental life processes, biological research nowadays tends to explore and perceive the biological system in its entirety rather than focusing on specific biological entities
We developed new algorithms optimized for normalizing organism terms mentioned in figure captions by considering both the resident text and the corresponding full text
Motivated by the ambiguous nature of organism terms in figure captions, we extended the multistage organism identification algorithm developed in our previous work [5] to capitalize on the information collected from full text to resolve the ambiguities

Summary

Introduction

To facilitate a better understanding of the fundamental life processes, biological research nowadays tends to explore and perceive the biological system in its entirety rather than focusing on specific biological entities. Liechti et al [1] recently announced the initiative of the SourceData platform which can link related figures among various papers together to form a searchable knowledge graph. It requires the expertise of life science and the bio-curators’ effort to manually identify biomedical entities in the text and link them to their corresponding database entries. This process is labor-intensive but indispensable to ensure the quality of the data. It is essential to develop new methods and tools to reduce the time and effort bio-curators spent on recognizing entities in figure captions and associating them with their corresponding database IDs [2]

Methods

Results

Discussion

Conclusion