Abstract

The text-mining services for kinome curation track, part of BioCreative VI, proposed a competition to assess the effectiveness of text mining to perform literature triage. The track has exploited an unpublished curated data set from the neXtProt database. This data set contained comprehensive annotations for 300 human protein kinases. For a given protein and a given curation axis [diseases or gene ontology (GO) biological processes], participants’ systems had to identify and rank relevant articles in a collection of 5.2 M MEDLINE citations (task 1) or 530 000 full-text articles (task 2). Explored strategies comprised named-entity recognition and machine-learning frameworks. For that latter approach, participants developed methods to derive a set of negative instances, as the databases typically do not store articles that were judged as irrelevant by curators. The supervised approaches proposed by the participating groups achieved significant improvements compared to the baseline established in a previous study and compared to a basic PubMed search.

Highlights

  • Introduction and motivationBiomedical big data offers tremendous potential for making discoveries and demands unprecedented efforts to keep structured databases up to date with the findings described in the torrent of publications [1]

  • Beyond simple ad hoc information retrieval, teams performed information extraction of biological entities in order to compute the relevance of documents for triage

  • Final participants first describe the strategies they used for their systems

Read more

Summary

Introduction and motivation

Biomedical big data offers tremendous potential for making discoveries and demands unprecedented efforts to keep structured databases up to date with the findings described in the torrent of publications [1]. In 2017, the BioCreative VI Kinome Track proposed a competition in literature triage based on the neXtProt unpublished protein kinase data set. The BioCreative VI Kinome Track data set contains comprehensive annotations about kinase substrates, GO biological processes and diseases. It covers a significant fraction of the human kinome: 300 proteins out of ∼500 human. Each annotation is supported by a reference to a publication, a PubMed identifier (PMID) This data set will be integrated in the neXtProt database in 2018, but it was still unpublished at the competition time.

Full texts 530 000 full texts 100
Results
Discussion and conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.