MeSH-Based Semantic Weighting Scheme to Enhance Document Indexing: Application on Biomedical Document Classification

Imen Gabsi,Dalila Souidi,Ikram Amous,Hager Kammoun

doi:10.1142/s0219649224500357

Abstract

Document indexing phase plays a significant role in text mining applications such as text document classification. The common indexing paradigm is based on terms frequency in documents known as Bag Of Words (BOW)-based representation approach. However, such classical approach suffers from ambiguity and disparity of words. In addition, traditional term weighting schemes, such as TF-IDF, exploit only the statistical information of terms in documents. To overcome these problems, we have been interested in biomedical semantic document indexing using concepts extracted from the knowledge resource MeSH. Thus, we have focused first on a disambiguation method to identify the adequate senses of ambiguous MeSH concepts and we have considered four representation enrichment strategies to identify the best appropriate representatives of the adequate sense in the textual entities representation. Second, we propose to introduce a semantic weighting scheme that quantifies MeSH concept’s importance in documents through their occurrence frequency and semantic similarities with unambiguous MeSH concepts. Our contribution lies particularly in the in-depth experimental study of the performance of these methods and precisely the impact of the semantic weighting scheme on the performance. To do that, three benchmark datasets TREC 2004 genomics, BioCreative II and OHSUMED were used.

Full Text