Traditional Chinese medicine (TCM) symptom normalization is difficult because the challenges of the symptoms having different literal descriptions, one-to-many symptom descriptions and different symptoms sharing a similar literal description. We propose a novel two-step approach utilizing hierarchical semantic information that represents the functional characteristics of symptoms and develop a text matching model that integrates hierarchical semantic information with an attention mechanism to solve these problems. In this study, we constructed a symptom normalization dataset and a TCM normalization symptom dictionary containing normalization symptom words, and assigned symptoms into 24 classes of functional characteristics. First, we built a multi-label text classifier to isolate the hierarchical semantic information from each symptom description and count the corresponding normalization symptoms and filter the candidate set. Then we designed a text matching model of mixed multi-granularity language features with an attention mechanism that utilizes the hierarchical semantic information to calculate the matching score between the symptom description and the normalization symptom words. We compared our approach with other baselines on real-world data. Our approach gives the best performance with a Hit@ 1, 3, and 10 of 0.821, 0.953, and 0.993, respectively, and a MeanRank of 1.596, thus outperforming significantly regarding the symptom normalization task. We developed an approach for the TCM symptom normalization task and demonstrated its superior performance compared with other baselines, indicating the promise of this research direction.
Read full abstract