RENET2: high-performance full-text gene-disease relation extraction with iterative training data expansion.

Junhao Su,Ruibang Luo,Hing-Fung Ting,Ye Wu,Tak-Wah Lam

doi:10.1093/nargab/lqab062

Junhao Su, Ruibang Luo + Show 3 more

Open Access

https://doi.org/10.1093/nargab/lqab062

Copy DOI

Journal: NAR Genomics and Bioinformatics	Publication Date: Jun 23, 2021
Citations: 10	License type: CC BY 4.0

Affiliation: University of Hong Kong

Abstract

Relation extraction (RE) is a fundamental task for extracting gene–disease associations from biomedical text. Many state-of-the-art tools have limited capacity, as they can extract gene–disease associations only from single sentences or abstract texts. A few studies have explored extracting gene–disease associations from full-text articles, but there exists a large room for improvements. In this work, we propose RENET2, a deep learning-based RE method, which implements Section Filtering and ambiguous relations modeling to extract gene–disease associations from full-text articles. We designed a novel iterative training data expansion strategy to build an annotated full-text dataset to resolve the scarcity of labels on full-text articles. In our experiments, RENET2 achieved an F1-score of 72.13% for extracting gene–disease associations from an annotated full-text dataset, which was 27.22, 30.30, 29.24 and 23.87% higher than BeFree, DTMiner, BioBERT and RENET, respectively. We applied RENET2 to (i) ∼1.89M full-text articles from PubMed Central and found ∼3.72M gene–disease associations; and (ii) the LitCovid articles and ranked the top 15 proteins associated with COVID-19, supported by recent articles. RENET2 is an efficient and accurate method for full-text gene–disease association extraction. The source-code, manually curated abstract/full-text training data, and results of RENET2 are available at GitHub.

Full Text