Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Gully A.P.C Burns,Anita De Waard,Pradeep Dasigi,Eduard H Hovy

doi:10.1093/database/baw122

Abstract

Automated machine-reading biocuration systems typically use sentence-by-sentence information extraction to construct meaning representations for use by curators. This does not directly reflect the typical discourse structure used by scientists to construct an argument from the experimental data available within a article, and is therefore less likely to correspond to representations typically used in biomedical informatics systems (let alone to the mental models that scientists have). In this study, we develop Natural Language Processing methods to locate, extract, and classify the individual passages of text from articles’ Results sections that refer to experimental data. In our domain of interest (molecular biology studies of cancer signal transduction pathways), individual articles may contain as many as 30 small-scale individual experiments describing a variety of findings, upon which authors base their overall research conclusions. Our system automatically classifies discourse segments in these texts into seven categories (fact, hypothesis, problem, goal, method, result, implication) with an F-score of 0.68. These segments describe the essential building blocks of scientific discourse to (i) provide context for each experiment, (ii) report experimental details and (iii) explain the data’s meaning in context. We evaluate our system on text passages from articles that were curated in molecular biology databases (the Pathway Logic Datum repository, the Molecular Interaction MINT and INTACT databases) linking individual experiments in articles to the type of assay used (coprecipitation, phosphorylation, translocation etc.). We use supervised machine learning techniques on text passages containing unambiguous references to experiments to obtain baseline F1 scores of 0.59 for MINT, 0.71 for INTACT and 0.63 for Pathway Logic. Although preliminary, these results support the notion that targeting information extraction methods to experimental results could provide accurate, automated methods for biocuration. We also suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases

Highlights

We suggest the need for finer-grained curation of experimental methods used when constructing molecular biology databases
We used a training set based on 258 passages and evaluated performance based on mean accuracy and weighted F1 measures from 5-fold cross validation with 24 different target classes. We repeated this process for articles from the MINT database and of the 359 passages we found from experimental text referring to figures, 136 of these passages were discarded due to referring to multiple types of experiments
A key element of this work is the linkage between individual subfigures and their underlying experiments. This linkage arises from core design decisions in both the Pathway Logic (PL) and MINT data sets, and we performed a small-scale manual evaluation as a part of this study

Summary

Introduction

High-level descriptive languages such as the ‘Systems-Biology Markup Language’ [2], BioPax [3] or the ‘Biological Expression Language’ [4] provide a semantic representation of pathways, reactions and reactants (with encodings for additional information such as genetic details, post-translational modifications, reaction kinetics etc.) These languages provide interpretable summaries of pathway mechanisms that can be read by humans and/or reasoned about by computational knowledge representation and reasoning methods. At a deeper level of description, executable languages such as Pathway Logic (PL) [6], Kappa [7], and PySB [8] provide simulation/reasoning frameworks that can make theoretical predictions about aspects of the state of the system under different hypothesized conditions [9] These types of formulations act as the target for reading systems and for the practice of biocuration generally

Objectives

Methods

Results

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Database : the journal of biological databases and curation	Publication Date: Jan 1, 2016
Citations: 21	License type: cc-by

R Discovery Prime

R Discovery Prime

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation

Lead the way for us

Similar Papers

Exploring Heterogeneous Molecular Biology Databases in the Context of the Object-Protocol Model
Victor M. Markowitz ... I-Min A. Chen
-
Victor M. Markowitz, et. al.Victor M. Markowitz ... I-Min A. Chen
01 Jan 1997
01 Jan 1997

Mapping information roadways from sequence to phenotype and across species.
M K B Berlyn
The Journal of heredity | VOL. 86
M K B BerlynM K B Berlyn
01 Sep 1995
The Journal of heredity | VOL. 86

BION2SEL: An Ontology-Based Approach for the Selection of Molecular Biology Databases
Daniel Lichtnow ... Ronnie Alves
-
Daniel Lichtnow, et. al.Daniel Lichtnow ... Ronnie Alves
01 Jan 2014
01 Jan 2014

A metadata approach to query interoperation between molecular biology databases.
K H Cheung ... D G Shin
Bioinformatics (Oxford, England) | VOL. 14
K H Cheung, et. al.K H Cheung ... D G Shin
01 Jan 1998
Bioinformatics (Oxford, England) | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated detection of discourse segment and experimental types from the text of cancer pathway results sections.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Database : the journal of biological databases and curation