Abstract

Keeping up with the rapidly growing literature has become virtually impossible for most scientists. This can have dire consequences. First, we may waste research time and resources on reinventing the wheel simply because we can no longer maintain a reliable grasp on the published literature. Second, and perhaps more detrimental, judicious (or serendipitous) combination of knowledge from different scientific disciplines, which would require following disparate and distinct research literatures, is rapidly becoming impossible for even the most ardent readers of research publications. Text mining -- the automated extraction of information from (electronically) published sources -- could potentially fulfil an important role -- but only if we know how to harness its strengths and overcome its weaknesses. As we do not expect that the rate at which scientific results are published will decrease, text mining tools are now becoming essential in order to cope with, and derive maximum benefit from, this information explosion. In genomics, this is particularly pressing as more and more rare disease-causing variants are found and need to be understood. Not being conversant with this technology may put scientists and biomedical regulators at a severe disadvantage. In this review, we introduce the basic concepts underlying modern text mining and its applications in genomics and systems biology. We hope that this review will serve three purposes: (i) to provide a timely and useful overview of the current status of this field, including a survey of present challenges; (ii) to enable researchers to decide how and when to apply text mining tools in their own research; and (iii) to highlight how the research communities in genomics and systems biology can help to make text mining from biomedical abstracts and texts more straightforward.

Highlights

  • The scientific literature provides an important source of knowledge generated by the research community; it does not become defunct five years after publication and it is not just something to promote the authors’ careers

  • This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process.[4]

  • Only a relatively small number of papers are available for full-text mining and so most work is restricted to abstracts and titles, which are freely available from MEDLINE (only 30 per cent of curated protein –protein interactions (PPIs) can be found in the abstracts rather than the full text9)

Read more

Summary

Introduction

The scientific literature provides an important source of knowledge generated by the research community; it does not become defunct five years after publication and it is not just something to promote the authors’ careers. While large amounts of data relating to biological systems are stored in public repositories, an even larger amount can be found in a semi-structured form in the literature (see Figure 1) This knowledge is potentially very useful in a variety of genomics and systems biology contexts.[1] For example, manually curated and literature-derived protein-protein interaction datasets are typically used as gold standards by the systems biology community and it is standard practice to extract parameters for mechanistic models from the literature. The increase in the numbers of papers being published means that it is becoming harder for researchers to stay up to date with the relevant literature in their field This has an impact on their ability to generate meaningful and testable hypotheses, with some even suggesting that this is becoming a bottleneck in the scientific discovery process.[4].

Part of speech
BioCreative II FT
Entity normalisation
Relation extraction
Finding new applications for genetic algorithms using the WWW
Findings
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call