Finding biomedical categories in Medline®

Lana Yeganova,Won Kim,Donald C Comeau,W John Wilbur

doi:10.1186/2041-1480-3-s3-s3

Abstract

BackgroundThere are several humanly defined ontologies relevant to Medline. However, Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Automatically identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. In this paper we describe and compare two methods for automatically learning meaningful biomedical categories in Medline. The first approach is a simple statistical method that uses part-of-speech and frequency information to extract a list of frequent nouns from Medline. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We then apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories.ResultsWe study and compare these two alternative sets of terms to identify semantic categories in Medline. We find that both approaches produce reasonable terms as potential categories. We also find that there is a significant agreement between the two sets of terms. The overlap between the two methods improves our confidence regarding categories predicted by these independent methods.ConclusionsThis study is an initial attempt to extract categories that are discussed in Medline. Rather than imposing external ontologies on Medline, our methods allow categories to emerge from the text.

Highlights

There are several humanly defined ontologies relevant to Medline
Finding meaningful categories of entities in such a large source of textual information is a useful task. These categories can be useful in constructing machine learning features, developing semantic representations for the text, finding smoothing or back-off probabilities for NLP tasks, and extracting information
One example is SemCat [1] which contains over 5 million entities and is based on subsets of UMLS enriched with additional categories from GENIA [2], UniProt [3], the Gene Ontology (GO) [4], Entrez Gene [5], and other knowledge sources

Summary

Introduction

There are several humanly defined ontologies relevant to Medline. Medline is a fast growing collection of biomedical documents which creates difficulties in updating and expanding these humanly defined ontologies. Identifying meaningful categories of entities in a large text corpus is useful for information extraction, construction of machine learning features, and development of semantic representations. The second method implements an alignment-based technique to learn frequent generic patterns that indicate a hyponymy/hypernymy relationship between a pair of noun phrases. We apply these patterns to Medline to collect frequent hypernyms as potential biomedical categories. Finding meaningful categories of entities in such a large source of textual information is a useful task These categories can be useful in constructing machine learning features, developing semantic representations for the text, finding smoothing or back-off probabilities for NLP tasks, and extracting information. It is an attempt to define some important categories in the area of molecular biology

Objectives

Methods

Results

Discussion

Conclusion

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of biomedical semantics	Publication Date: Oct 1, 2012
Citations: 12	License type: CC BY 2.0

R Discovery Prime

R Discovery Prime

Finding biomedical categories in Medline®

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical semantics

Lead the way for us

Similar Papers

Comparison of Two Methods for Finding Biomedical Categories in Medline
L Yeganova ... D C Comeau
-
L Yeganova, et. al.L Yeganova ... D C Comeau
01 Dec 2011
01 Dec 2011

Unsupervised Concept Hierarchy Learning: A Topic Modeling Guided Approach
V.S Anoop ... P Deepak
Procedia computer science | VOL. 89
V.S Anoop, et. al.V.S Anoop ... P Deepak
01 Jan 2015
Procedia computer science | VOL. 89

Semantic Network Analysis Pipeline—Interactive Text Mining Framework for Exploration of Semantic Flows in Large Corpus of Text
Martin Cenek ... Eric Pak
Applied sciences | VOL. 9
Martin Cenek, et. al.Martin Cenek ... Eric Pak
05 Dec 2019
Applied sciences | VOL. 9

NoCS2: Topic-Based Clustering of Big Data Text Corpus in the Cloud
S.M Zobaed ... Razin Farhan Hussain
-
S.M Zobaed, et. al.S.M Zobaed ... Razin Farhan Hussain
01 Dec 2018
01 Dec 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Finding biomedical categories in Medline®

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Journal of biomedical semantics