An Improved Deep Learning Model: S-TextBLCNN for Traditional Chinese Medicine Formula Classification.

Ning Cheng,Jiajun Liu,Cheng Yan,Qunfu Huang,Wanqing Gao,Changsong Ding,Yue Chen,Xindi Huang

doi:10.3389/fgene.2021.807825

Ning Cheng, Jiajun Liu + Show 6 more

Open Access

https://doi.org/10.3389/fgene.2021.807825

Copy DOI

Abstract

Purpose: This study proposes an S-TextBLCNN model for the efficacy of traditional Chinese medicine (TCM) formula classification. This model uses deep learning to analyze the relationship between herb efficacy and formula efficacy, which is helpful in further exploring the internal rules of formula combination. Methods: First, for the TCM herbs extracted from Chinese Pharmacopoeia, natural language processing (NLP) is used to learn and realize the quantitative expression of different TCM herbs. Three features of herb name, herb properties, and herb efficacy are selected to encode herbs and to construct formula-vector and herb-vector. Then, based on 2,664 formulae for stroke collected in TCM literature and 19 formula efficacy categories extracted from Yifang Jijie, an improved deep learning model TextBLCNN consists of a bidirectional long short-term memory (Bi-LSTM) neural network and a convolutional neural network (CNN) is proposed. Based on 19 formula efficacy categories, binary classifiers are established to classify the TCM formulae. Finally, aiming at the imbalance problem of formula data, the over-sampling method SMOTE is used to solve it and the S-TextBLCNN model is proposed. Results: The formula-vector composed of herb efficacy has the best effect on the classification model, so it can be inferred that there is a strong relationship between herb efficacy and formula efficacy. The TextBLCNN model has an accuracy of 0.858 and an F1-score of 0.762, both higher than the logistic regression (acc = 0.561, F1-score = 0.567), SVM (acc = 0.703, F1-score = 0.591), LSTM (acc = 0.723, F1-score = 0.621), and TextCNN (acc = 0.745, F1-score = 0.644) models. In addition, the over-sampling method SMOTE is used in our model to tackle data imbalance, and the F1-score is greatly improved by an average of 47.1% in 19 models. Conclusion: The combination of formula feature representation and the S-TextBLCNN model improve the accuracy in formula efficacy classification. It provides a new research idea for the study of TCM formula compatibility.

Highlights

The Chinese herbal formula is the connection of traditional Chinese medicine (TCM) basic theory and clinic, and it is the link between syndrome differentiation and treatment of TCM
Our finding provides a new way to study the efficacy of TCM formula classification
We introduce some common Deep Learning (DL) models for TCM formulae classification, including two basic models: TextCNN and LSTM

Summary

Introduction

The Chinese herbal formula is the connection of traditional Chinese medicine (TCM) basic theory and clinic, and it is the link between syndrome differentiation and treatment of TCM. Clarifying the modern scientific connotation of the implied relationships between formula compatibility and efficacy systematically is a pressing problem in modern formula research. It is a major direction for the inheritance and innovation of TCM. Data imbalance often impairs model prediction (Indraswari et al, 2019; Yeh et al, 2020). These issues have caused many difficulties and challenges in TCM formulae research

Methods

Results

Conclusion