EXTRACT 2.0: interactive identification of biological entities mentioned in text to assist database curation and knowledge extraction

Evangelos Pafilis,Lars Jensen,Rūdolfs Bērzinš,Christos Arvanitidis

doi:10.3897/tdwgproceedings.1.20152

Abstract

Data curation is a process occurring in many aspects of biodiversity research, e.g. in digitization of specimen collections and extraction of species occurrences from the legacy literature. Data curation is always characterized by being time demanding and tedious. Gathering information on species and exposing it via search interfaces could be facilitated once phrases of interest have been recognized and the mentioned entities have been linked to community resources. A curator can benefit from interactive systems that highlight biological entities in a document, indicating sections of interest, and map entities to corresponding database records/ontology terms, and offering an easy mechanism for extracting annotations in a structured form. EXTRACT (https://extract.hcmr.gr, Pafilis et al. 2016) is a system that aims to address the above challenges. Its web User Interface is a bookmarklet that identifies genes/proteins, chemical compounds, organisms, environments, tissues, diseases, phenotypes and Gene Ontology terms mentioned in a web page and maps them to their corresponding database, ontology, and taxonomy entries. Two modes of operation are supported: a. extraction of biological entities mentioned in user-selected piece of text, and b. full-page tagging. To easily collect extracted annotations, e.g. for use in an Excel spreadsheet, direct Copy to clipboard and Save to file (tab-delimited) are supported. EXTRACT was originally developed specifically to facilitate metagenomic sample record annotation (Pafilis et al. 2016). As such it participated in the BioCreative V interactive annotation task. EXTRACT achieved one of the top scores in terms of usability and was evaluated to accelerate curation by 15–25% (Wang et al. 2016). The latest version of EXTRACT (2.0, Pafilis et al. 2017) serves a much broader audience involving both biomedicine and biodiversity researchers and thus recognizes a wide range of entity types from many community resources: Organisms (NCBI Taxonomy, https://www.ncbi.nlm.nih.gov/taxonomy) Environments (Environment Ontology, Buttigieg et al. 2016) Diseases and phenotypes (Disease Ontology, Kibbe et al. 2014, and Mammalian Phenotype Ontology, Smith and Eppig 2012) Tissues and cell lines (Brenda Tissue Ontology, Placzek et al. 2016) Biological processes, molecular functions, and cellular components (Gene Ontology, Gene Ontology Consortium 2014) Protein-coding and non-coding RNA (ncRNA) genes from more than 2000 organisms (STRING (Szklarczyk et al. 2017) and RAIN (Junge et al. 2017)) Small molecule compounds (STITCH (Szklarczyk et al. 2015)) In addition to curators benefitting from such a tool, knowledge-base developers can easily integrate the EXTRACT functionality into their own systems. To this end, we provide a robust and thoroughly documented Application Programming Interface (https://extract.hcmr.gr, FAQ section). EXTRACT can thus serve as a building block in large knowledge management pipelines, which also perform downstream tasks such as statistical entity association and association extraction, knowledge graph generation presenting the extracted associations, document indexing and information retrieval. Such tasks lie at the core of the workshop this abstract has been submitted to and are pertinent to the TDWG 2017 theme, which is dedicated to the integration of species occurrence, gene, phenotype, and environment associations.

Full Text