PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Sun Kim,Lana Yeganova,W John Wilbur,Zhiyong Lu,Donald C Comeau

doi:10.1038/sdata.2018.104

Sun Kim, Lana Yeganova + Show 3 more

Open Access

https://doi.org/10.1038/sdata.2018.104

Copy DOI

Abstract

In biomedicine, key concepts are often expressed by multiple words (e.g., ‘zinc finger protein’). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.

Highlights

Background & SummaryUnlike other general domains, the language of biomedicine uses its own terminology to describe scientific discoveries and applications
We examined the composition of the set and found that 84.1% of the phrases are noun phrases
We further noticed from the PubMed user logs that, given documents scored by BM25, users are four times more likely to click on a document containing query terms in the title than on a document that does not

Summary

Background & Summary

The language of biomedicine uses its own terminology to describe scientific discoveries and applications. Collocations are restricted to noun/adjective phrases or phrasal verbs, whereas we do not limit phrases grammatically, but rather see them as more flexible entities to be used as building blocks to form longer phrases or sentences Such an interpretation of phrases is better aligned with our goal of using the corpus to analyze queries, as queries may frequently contain incomplete phrases and, in general, are known to differ from traditional forms of written language[4]. To compute and compare the retrieval performance in the absence of a manually annotated gold standard, we use a novel pseudo-relevance judgement technique, which is based on the assumption that the documents containing query terms in the titles are more relevant to the query than the documents that do not[13] Guided by this evaluation, we collect a set of 705,915 multi-word strings that benefit from being interpreted as phrases rather than individual tokens in terms of retrieval performance. Throughout this paper, the term phrase refers to a coherent chunk of words that are frequently used together

Methods

PubMed Phrases

Data Records

Mean average precision

Usage Notes

Topic terms from LDA

Findings

Additional information

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Scientific Data	Publication Date: Jun 12, 2018
Citations: 14	License type: open-access

R Discovery Prime

R Discovery Prime

PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Data

Lead the way for us

Similar Papers

Does semantic knowledge influence event segmentation and recall of text?
Kimberly M Newberry ... Heather R Bailey
Memory & Cognition | VOL. 47
Kimberly M Newberry, et. al.Kimberly M Newberry ... Heather R Bailey
26 Mar 2019
Memory & Cognition | VOL. 47

A case study on decompounding in Indian language IR
Siba Sankar Sahu ... Sukomal Pal
Natural Language Processing | VOL. -
Siba Sankar Sahu, et. al.Siba Sankar Sahu ... Sukomal Pal
03 Jun 2024
Natural Language Processing | VOL. -

Personalization for Google Now
Shashi Thakur
-
Shashi ThakurShashi Thakur
07 Sep 2016
07 Sep 2016

Resources for evaluation of summarization techniques.
...
-
, et. al. ...
01 Jan 1998
01 Jan 1998

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

PubMed Phrases, an open set of coherent phrases for searching biomedical literature

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Scientific Data