Abstract

BackgroundThe Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. Due to the richness of these annotations, the corpus provides a good testbed for evaluation of biomedical literature information extraction systems.MethodsIn this paper, we focus on assessing performance on extracting the relations in the corpus, using gold standard entities as a starting point, to establish a baseline for extraction of relations important for extraction of genetic variant information from the literature. We test the application of the Public Knowledge Discovery Engine for Java (PKDE4J) system, a natural language processing system designed for information extraction of entities and relations in text, on the relation extraction task using this corpus.ResultsFor the relations which are attested at least 100 times in the Variome corpus, we realise a performance ranging from 0.78–0.84 Precision-weighted F-score, depending on the relation. We find that the PKDE4J system adapted straightforwardly to the range of relation types represented in the corpus; some extensions to the original methodology were required to adapt to the multi-relational classification context. The results are competitive with state-of-the-art relation extraction performance on more heavily studied corpora, although the analysis shows that the Recall of a co-occurrence baseline outweighs the benefit of improved Precision for many relations, indicating the value of simple semantic constraints on relations.ConclusionsThis work represents the first attempt to apply relation extraction methods to the Variome corpus. The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts. Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature.

Highlights

  • The promise of precision medicine is that individual variation at the genomic level can provide important insights into the detailed disease status of a patient, and guide the selection of the best choice of treatment for that individual

  • The results demonstrate that automated methods have good potential to structure the information expressed in the published literature related to genetic variants, connecting mutations to genes, diseases, and patient cohorts

  • Further development of such approaches will facilitate more efficient biocuration of genetic variant information into structured databases, leveraging the knowledge embedded in the vast publication literature

Read more

Summary

Introduction

The promise of precision medicine is that individual variation at the genomic level can provide important insights into the detailed disease status of a patient, and guide the selection of the best choice of treatment for that individual. The Variome corpus, a small collection of published articles about inherited colorectal cancer, includes annotations of 11 entity types and 13 relation types related to the curation of the relationship between genetic variation and disease. The Variome corpus is differentiated from these corpora in that it includes relations connecting patient cohorts to mutations and diseases, allowing for detailed extraction of the characteristics of specific subgroups of patients described in the literature Both of these new corpora focus on extracted associations rather than text-bound relations; they do not adopt a standard representation for corpus annotations (e.g., BioC [17] or brat [18] format) and it is difficult to identify specific annotations of relations that are tied to specific text spans. We further note that the BRONCO corpus has not been used to test relation extraction methods

Objectives
Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.