Abstract

Biomedical terminologies such as National Cancer Institute thesaurus (NCIt) have been widely used in supporting various biomedical research and applications. Therefore, the quality of biomedical terminologies directly impacts their downstream applications. In this paper, we introduce a hybrid method to identify missing hierarchical IS-A relations in NCIt, by leveraging both role definitions and lexical features of concepts in non-lattice subgraphs. We first extract non-lattice subgraphs in NCIt, problematic areas with quality issue. We model each concept using its role definitions and words in its concept name as well as words in the names of its ancestors. Then we perform a two-step subsumption testing for candidate pairs of concepts in the non-lattice subgraphs to automatically suggest potentially missing IS-A relations. We applied our method to the 19.01d version of NCIt. A total of 9,512 non-lattice subgraphs were extracted, among which 654 of them revealed 268 potentially missing IS-A relations. After the removal of duplication and redundancy, 121 potentially missing IS-A relations were obtained. To evaluate our method, we adopted a retrospective ground truth (RGT)-based idea to use version difference as the reference standard. We constructed a reference standard based on the IS-A changes between the 19.01d and 19.07e versions of NCIt. Among 121 potentially missing IS-A relations suggested by our method, 46 out of them are valid according to the reference standard. The RGT-based evaluation indicates that our hybrid method is promising in detecting missing IS-A relations which motivates us to perform a thorough evaluation by domain experts in future work.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call