Abstract

BackgroundMany disease causing genes have been identified through different methods, but there have been no uniform annotations of biomedical named entity (bio-NE) of the disease phenotypes of these genes yet. Furthermore, semantic similarity comparison between two bio-NE annotations has become important for data integration or system genetics analysis.ResultsThe package pyMeSHSim recognizes bio-NEs by using MetaMap which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to Medical Subject Headings (MeSH), pyMeSHSim is embedded with a house-made dataset containing the main headings (MHs), supplementary concept records (SCRs), and their relations in MeSH. Based on the dataset, pyMeSHSim implemented four information content (IC)-based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The pyMeSHSim introduced SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts, which improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of 461 GWAS phenotypes, pyMeSHSim showed recall > 0.94, precision > 0.56, and F1 > 0.70, demonstrating better performance than the state-of-the-art tools DNorm and TaggerOne in recognizing MeSH terms from short biomedical phrases. The semantic similarity in MeSH terms recognized by pyMeSHSim and the previous manual work was calculated by pyMeSHSim and another semantic analysis tool meshes, respectively. The result indicated that the correlation of semantic similarity analysed by two tools reached as high as 0.89–0.99.ConclusionsThe integrative MeSH tool pyMeSHSim embedded with the MeSH MHs and SCRs realized the bio-NE recognition, normalization, and comparison in biomedical text-mining.

Highlights

  • Many disease causing genes have been identified through different methods, but there have been no uniform annotations of biomedical named entity of the disease phenotypes of these genes yet

  • Evaluation with Online Mendelian Inheritance in Man (OMIM) phenotypes To test whether the introduction of Supplementary concept record (SCR) and our curation strategy of non-MeSHsynonymous Unified Medical Language System (UMLS) concepts contributes to improving the performance of pyMeSHSim in biomedical named entity (bio-NE) recognition, we compared the genes annotated with Medical Subject Heading (MeSH) Main headings (MH) and SCRs from OMIM [18] phenotype-gene pairs

  • The OMIM phenotype-gene pairs were collected from the database disease-connect [30], which used MetaMap to process the disease phenotypes into MeSH-synonymous and non-MeSH-synonymous UMLS concepts

Read more

Summary

Introduction

Many disease causing genes have been identified through different methods, but there have been no uniform annotations of biomedical named entity (bio-NE) of the disease phenotypes of these genes yet. Semantic similarity comparison between two bio-NE annotations has become important for data integration or system genetics analysis. Biomedical named entity (bio-NE) recognition, normalization, and comparison are fundamental tasks for extracting and utilizing valuable biomedical information from textual data. They are important to disease diagnosis [1], drug repositioning [2], overrepresentation analysis [3], and genetic analysis [4]. Due to its precise literature annotations, MeSH has become more and more popular for normalizing bio-NEs such as disease names, in medical and genetic public databases [8, 9]. Like the structure of Gene Ontology [10] and Disease Ontology, the structure of MeSH as a directed acyclic graph [11] allows the comparison of semantic similarity between two MeSH terms in the graph

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call