Abstract

Scientific theories and models in Earth science typically involve changing variables and their complex interactions, including correlations, causal relations and chains of positive/negative feedback loops. Variables tend to be complex rather than atomic entities and expressed as noun phrases containing multiple modifiers, e.g. oxygen depletion in the upper 500 m of the ocean or timing and magnitude of surface temperature evolution in the Southern Hemisphere in deglacial proxy records. Text mining from Earth science literature is therefore significantly different from biomedical text mining and requires different approaches and methods. Our approach aims at automatically locating and extracting variables and their direction of variation: increasing, decreasing or just changing. Variables are initially extracted by matching tree patterns onto the syntax trees of the source texts. Next, variables are generalised in order to enhance their similarity, facilitating hierarchical search and inference. This generalisation is accomplished by progressive pruning of syntax trees using a set of tree transformation operations. Text mining results are presented as a browsable variable hierarchy which allows users to inspect all mentions of a particular variable type in the text as well as any generalisations or specialisations. The approach is demonstrated on a corpus of 10k abstracts of Nature publications in the field of Marine science. We discuss experiences with this early prototype and outline a number of possible improvements and directions for future re

Highlights

  • As a partial solution to this problem, we propose progressive pruning of syntax trees using a set of tree transformation operations

  • We have argued that the paradigm established in biomedical text mining does not transfer directly to other scientific domains like Earth science

  • A new approach was proposed for extracting variables and their direction of variation, focusing on events rather than entities

Read more

Summary

Introduction

Text mining of scientific literature originates from efforts to cope with the ever growing flood of publications in biomedicine (Swanson, 1986; Swanson, 1988; Swanson and Smalheiser, 1997; Hearst, 1999; Ananiadou et al, 2006; Zweigenbaum et al, 2007; Cohen and Hersh, 2005; Krallinger et al, 2008; Rodriguez-Esteban, 2009; Zweigenbaum and Demner-Fushman, 2009; Ananiadou et al, 2010; Simpson and Demner-Fushman, 2012; Ananiadou et al, 2014). We found that due to significant differences between the conceptual frameworks of biomedicine and marine science, “porting” the biomedical text mining infrastructure to another domain will not suffice. Defining the entities of interest in marine science turns out to be much harder Does it seem to be more open-ended in nature, the entities themselves tend to be complex and expressed as noun phrases containing multiple modifiers, giving rise to examples like oxygen depletion in the upper 500 m of the ocean or timing and magnitude of surface temperature evolution in the Southern Hemisphere in deglacial proxy records. Since many of these changing variables are long and complex expressions, their frequency of occurrence tends to be low, making the discovery of relations among different variables harder. Text mining results are presented as a browsable variable hierarchy which allows users to inspect all mentions of a particular variable type in the text as well as any generalisations or specialisations

Variable extraction
Variable generalisation
User interface
Findings
Discussion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call