Automated recognition of malignancy mentions in biomedical literature

Yang Jin,Fernando C Pereira,Mark Y Liberman,Kevin Lerman,Ryan T Mcdonald,Peter S White,Raymond S Winters,Mark A Mandel,Steven Carroll

doi:10.1186/1471-2105-7-492

Abstract

BackgroundThe rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Automated information extraction procedures can assist in the acquisition and management of this knowledge. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. Furthermore, promise has been tempered by an inability to efficiently scale approaches in ways that minimize manual efforts and still perform with high accuracy. Here, we have applied a machine-learning approach previously successful for identifying molecular entities to a disease concept to determine if the underlying probabilistic model effectively generalizes to unrelated concepts with minimal manual intervention for model retraining.ResultsWe developed a named entity recognizer (MTag), an entity tagger for recognizing clinical descriptions of malignancy presented in text. The application uses the machine-learning technique Conditional Random Fields with additional domain-specific features. MTag was tested with 1,010 training and 432 evaluation documents pertaining to cancer genomics. Overall, our experiments resulted in 0.85 precision, 0.83 recall, and 0.84 F-measure on the evaluation set. Compared with a baseline system using string matching of text with a neoplasm term list, MTag performed with a much higher recall rate (92.1% vs. 42.1% recall) and demonstrated the ability to learn new patterns. Application of MTag to all MEDLINE abstracts yielded the identification of 580,002 unique and 9,153,340 overall mentions of malignancy. Significantly, addition of an extensive lexicon of malignancy mentions as a feature set for extraction had minimal impact in performance.ConclusionTogether, these results suggest that the identification of disparate biomedical entity classes in free text may be achievable with high accuracy and only moderate additional effort for each new application domain.

Highlights

The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest
Automated information extraction methods, which have recently been increasingly concentrated upon biomedical text, can assist in the acquisition and management of this data
Text mining applications have been successful in other domains and show promise for biomedical information extraction, issues of scalability impose significant impediments to broad use in biomedicine

Summary

Introduction

The rapid proliferation of biomedical text makes it increasingly difficult for researchers to identify, synthesize, and utilize developed knowledge in their fields of interest. Previous efforts in biomedical text mining have focused primarily upon named entity recognition of well-defined molecular objects such as genes, but less work has been performed to identify disease-related objects and concepts. The rapid proliferation of this information makes it increasingly difficult for researchers and clinicians to peruse, query, and synthesize it for biomedical knowledge gain. Automated information extraction methods, which have recently been increasingly concentrated upon biomedical text, can assist in the acquisition and management of this data. Particular challenges for text mining include the requirement for highly specified extractors in order to generate accuracies sufficient for users; considerable effort by highly trained computer scientists with substantial input by biomedical domain experts to develop extractors; and a significant body of manually annotated text – with comparable effort in generating annotated corpora – for training machine-learning extractors. The high number and wide diversity of biomedical entity types, along with the high complexity of biomedical literature, makes auto-annotation of multiple biomedical entity classes a difficult and laborintensive task

Methods

Results

Discussion

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: BMC Bioinformatics	Publication Date: Nov 7, 2006
Citations: 68	License type: cc-by

R Discovery Prime

R Discovery Prime

Automated recognition of malignancy mentions in biomedical literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Single Concatenated Input is Better than Indenpendent Multiple-input for CNNs to Predict Chemical-induced Disease Relation from Literature
Bui Manh Thang ... Pham Thi Quynh Trang
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36
Bui Manh Thang, et. al.Bui Manh Thang ... Pham Thi Quynh Trang
30 May 2020
VNU Journal of Science: Computer Science and Communication Engineering | VOL. 36

BioBERT: a pre-trained biomedical language representation model for biomedical text mining.
Jinhyuk Lee ... Donghyeon Kim
Bioinformatics | VOL. 36
Jinhyuk Lee, et. al.Jinhyuk Lee ... Donghyeon Kim
10 Sep 2019
Bioinformatics | VOL. 36

Getting started in text mining.
K Bretonnel Cohen ... Lawrence Hunter
PLoS Computational Biology | VOL. 4
K Bretonnel Cohen, et. al.K Bretonnel Cohen ... Lawrence Hunter
01 Jan 2008
PLoS Computational Biology | VOL. 4

Introduction to BLAH5 special issue: recent progress on interoperability of biomedical text mining.
Jin-Dong Kim ... Nigel Collier
Genomics & informatics | VOL. 17
Jin-Dong Kim, et. al.Jin-Dong Kim ... Nigel Collier
27 Jun 2019
Genomics & informatics | VOL. 17

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Automated recognition of malignancy mentions in biomedical literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics