Assisting document triage for human kinome curation via machine learning.

Yi-Yu Hsu,Zhiyong Lu,Chih-Hsuan Wei

doi:10.1093/database/bay091

Abstract

In the era of data explosion, the increasing frequency of published articles presents unorthodox challenges to fulfill specific curation requirements for bio-literature databases. Recognizing these demands, we designed a document triage system with automatic methods that can improve efficiency to retrieve the most relevant articles in curation workflows and reduce workloads for biocurators. Since the BioCreative VI (2017), we have implemented texting mining processing in our system in hopes of providing higher effectiveness for curating articles related to human kinase proteins. We tested several machine learning methods together with state-of-the-art concept extraction tools. For features, we extracted rich co-occurrence and linguistic information to model the curation process of human kinome articles by the neXtProt database. As shown in the official evaluation on the human kinome curation task in BioCreative VI, our system can effectively retrieve 5.2 and 6.5 kinase articles with the relevant disease (DIS) and biological process (BP) information, respectively, among the top 100 returned results. Comparing to neXtA5, our system demonstrates significant improvements in prioritizing kinome-related articles as follows: our system achieves 0.458 and 0.109 for the DIS axis whereas the neXtA5’s best-reported mean average precision (MAP) and maximum precision observed are 0.41 and 0.04. Our system also outperforms the neXtA5 in retrieving BP axis with 0.195 for MAP and the neXtA5’s reported value was 0.11. These results suggest that our system may be able to assist neXtProt biocurators in practice.

Highlights

Document triage typically refers to the process of scanning all query-related papers and finding relevant ones for further curation
To name a few of the leading research efforts, Kim et al [5] used a machine learning (ML) approach to triage Comparative Toxicogenomics Database (CTD)-relevant articles based on their prior system for the protein–protein interaction article classification task in BioCreative III
We propose an ML approach to identify articles that describe a specific kinase and its relation to DISs or biological process (BP) in the abstract

Summary

Introduction

Document triage typically refers to the process of scanning all query-related papers and finding relevant ones for further curation. Given the ever-growing biomedical literature and high cost of manual curation, there is an increasing need of leveraging automatic text-mining methods to identify and prioritize the documents for manual curation For this purpose, Critical Assessment of Information. Extraction Systems in Biology (BioCreative) has recently organized several document triage challenge tasks for protein–protein interaction and Comparative Toxicogenomics Database (CTD) curation [1, 2]. These efforts have resulted in several successful integration and deployment of text mining systems into production curation pipelines such as the use of PubTator and eGenPub in the UniProt protein curation [3, 4]. We find that our system can effectively reduce the workloads of biocurators and improve productivity

Methods

Result

Method

Discussion and conclusion