HypertenGene: extracting key hypertension genes from biomedical literature with position and automatically-generated template features

Richard Tzong-Han Tsai,Yen-Ching Chang,Wen-Lian Hsu,Chi-Hsin Huang,Yue-Yang Bow,Wen-Harn Pan,Hong-Jie Dai,Po-Ting Lai

doi:10.1186/1471-2105-10-s15-s9

Richard Tzong-Han Tsai, Yen-Ching Chang + Show 6 more

Open Access

https://doi.org/10.1186/1471-2105-10-s15-s9

Copy DOI

Abstract

BackgroundThe genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers' primary research tasks is to locate key hypertension-related genes in abstracts. However, gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. (2) The search results do not highlight the hypertension-related genes discovered in the abstract. (3) Even though some text mining services mark up gene names in the abstract, the key genes investigated in a paper are still not distinguished from other genes. To facilitate the information gathering process for hypertension researchers, one solution would be to extract the key hypertension-related genes in each abstract. Three major tasks are involved in the construction of this system: (1) gene and hypertension named entity recognition, (2) section categorization, and (3) gene-hypertension relation extraction.ResultsWe first compare the retrieval performance achieved by individually adding template features and position features to the baseline system. Then, the combination of both is examined. We found that using position features can almost double the original AUC score (0.8140vs.0.4936) of the baseline system. However, adding template features only results in marginal improvement (0.0197). Including both improves AUC to 0.8184, indicating that these two sets of features are complementary, and do not have overlapping effects. We then examine the performance in a different domain--diabetes, and the result shows a satisfactory AUC of 0.83.ConclusionOur approach successfully exploits template features to recognize true hypertension-related gene mentions and position features to distinguish key genes from other related genes. Templates are automatically generated and checked by biologists to minimize labor costs. Our approach integrates the advantages of machine learning models and pattern matching. To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the GAD database. Furthermore, our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could also be applied to key-gene extraction for other diseases.

Highlights

The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject
To the best of our knowledge, this the first systematic study of extracting hypertension-related genes and the first attempt to create a hypertension-gene relation corpus based on the gene association database (GAD) database
Our paper proposes and tests novel features for extracting key hypertension genes, such as relative position, section, and template features, which could be applied to key-gene extraction for other diseases

Summary

Introduction

The genetic factors leading to hypertension have been extensively studied, and large numbers of research papers have been published on the subject. One of hypertension researchers’ primary research tasks is to locate key hypertension-related genes in abstracts. Gathering such information with existing tools is not easy: (1) Searching for articles often returns far too many hits to browse through. Many hypertension researchers use PubMed to find and sort through papers of their interest, one of their primary research goals being to locate potentially hypertensionrelated genes Gathering such information with existing tools is not easy. There are text mining services that provide named entity recognition and mark up the gene names in an abstract, these systems do not distinguish the key genes that are the focus of research in the paper from other related genes that are merely mentioned

Methods

Results

Discussion

Conclusion