Traditional Chinese medicine text segmentation model with multi-metadata embedding based on Bidirectional LSTM
Traditional Chinese medicine text segmentation model with multi-metadata embedding based on Bidirectional LSTM
- Research Article
23
- 10.1002/asi.20237
- Sep 9, 2005
- Journal of the American Society for Information Science and Technology
The authors propose a heuristic method for Chinese automatic text segmentation based on a statistical approach. This method is developed based on statistical information about the association among adjacent characters in Chinese text. Mutual information of bi‐grams and significant estimation of tri‐grams are utilized. A heuristic method with six rules is then proposed to determine the segmentation points in a Chinese sentence. No dictionary is required in this method. Chinese text segmentation is important in Chinese text indexing and thus greatly affects the performance of Chinese information retrieval. Due to the lack of delimiters of words in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, segmentation ambiguities and occurrences of out‐of‐vocabulary words (i.e., unknown words) are the major challenges in Chinese segmentation. Many research studies dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. The experimental result shows that the proposed heuristic method is promising to segment the unknown words as well as the known words. The authors further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with a previous proposed technique, boundary detection. It is found that the heuristic method outperformed the boundary detection method.
- Book Chapter
8
- 10.1007/978-3-540-24594-0_52
- Jan 1, 2003
Chinese text segmentation is important in Chinese text indexing. Due to the lack of word delimiters in Chinese text, Chinese text segmentation is more difficult than English text segmentation. Besides, the segmentation ambiguities and the occurrences of out-of-vocabulary words (i.e. unknown words) are the major challenges in Chinese segmentation. Many research works dealing with the problem of word segmentation have focused on the resolution of segmentation ambiguities. The problem of unknown word identification has not drawn much attention. In this paper, we propose a heuristic method for Chinese test segmentation based on the statistical approach. The experimental result shows that our proposed heuristic method is promising to segment the unknown words as well as the known words. We have further investigated the distribution of the errors of commission and the errors of omission caused by the proposed heuristic method and benchmarked the proposed heuristic method with our previous proposed technique, boundary detection.
- Conference Article
1
- 10.1109/tocs53301.2021.9688602
- Dec 10, 2021
Word segmentation is a basic task of natural language processing, whose purpose is to correctly segment the text according to the context. Due to the vagueness, classical language, word order fixation, and unstructured characteristics of Traditional Chinese Medicine (TCM) text, the problem of word segmentation has not been effectively solved. This paper uses 20, 000 TCM text collected from the Chinese medicine clinic of the Second Affiliated Hospital of Shandong University of Traditional Chinese Medicine from 2005 to 2020 as the data set, data set source URL: http://www.sdmlzy.com. By labeling the four-word position of the characters in the text of TCM text, the word2vec is applied, by using the Long Short-Term Memory (LSTM) variant bidirectional gate recurring unit Gated Recurrent Unit (GRU) and then using Viterbi algorithm to achieve the resolution of TCM text word segmentation. Experimental results show that the word segmentation model proposed in this paper simplifies the gate structure based on inheriting automatic learning features and using contextual information. The model is applied to the word segmentation of Chinese medicine text corpus, and the precision rate reached 93.26%.
- Research Article
125
- 10.1002/(sici)1097-4571(199310)44:9<532::aid-asi3>3.0.co;2-m
- Oct 1, 1993
- Journal of the American Society for Information Science
Present text retrieval systems are generally built on the reductionist basis that words in texts (keywords) are used as indexing terms to represent the texts. A necessary precursor to these systems is word extraction which, for English texts, can be achieved automatically by using spaces and punctuations as word delimiters. This cannot be readily applied to Chinese texts because they do not have obvious word boundaries. A Chinese text consists of a linear sequence of nonspaced or equally spaced ideographic characters, which are similar to morphemes in English. Researchers of Chinese text retrieval have been seeking methods of text segmentation to divide Chinese texts automatically into words. First, a review of these methods is provided in which the various different approaches to Chinese text segmentation have been classified in order to provide a general picture of the research activity in this area. Some of the most important work is described. There follows a discussion of the problems of Chinese text segmentation with examples to illustrate. These problems include morphological complexities, segmentation ambiguity, and parsing problems, and demonstrate that text segmentation remains one of the most challenging and interesting areas for Chinese text retrieval. © 1993 John Wiley & Sons, Inc.
- Conference Article
6
- 10.1145/996350.996410
- Jun 7, 2004
The Chinese text segmentation is important for the indexing of Chinese documents, which has significant impact on the performance of Chinese information retrieval. The statistical approach overcomes the limitations of the dictionary based approach. The statistical approach is developed by utilizing the statistical information about the association of adjacent characters in Chinese text collected from the Chinese corpus. Both known words and unknown words can be segmented by the statistical approach. However, errors may occur due to the limitation of the corpus. In this work, we have conducted the error analysis of two Chinese text segmentation techniques using statistical approach, namely, boundary detection and heuristic method. Such error analysis is useful for the future development of the automatic text segmentation of Chinese text or other text in oriental languages. It is also helpful to understand the impact of these errors on the information retrieval system in digital libraries.
- Research Article
- 10.1016/j.cageo.2023.105512
- Dec 23, 2023
- Computers & Geosciences
A hybrid method of combination probability and machine learning for Chinese geological text segmentation
- Research Article
4
- 10.1155/2021/2337924
- Nov 29, 2021
- Evidence-Based Complementary and Alternative Medicine
The text similarity calculation plays a crucial role as the core work of artificial intelligence commercial applications such as traditional Chinese medicine (TCM) auxiliary diagnosis, intelligent question and answer, and prescription recommendation. However, TCM texts have problems such as short sentence expression, inaccurate word segmentation, strong semantic relevance, high feature dimension, and sparseness. This study comprehensively considers the temporal information of sentence context and proposes a TCM text similarity calculation model based on the bidirectional temporal Siamese network (BTSN). We used the enhanced representation through knowledge integration (ERNIE) pretrained language model to train character vectors instead of word vectors and solved the problem of inaccurate word segmentation in TCM. In the Siamese network, the traditional fully connected neural network was replaced by a deep bidirectional long short-term memory (BLSTM) to capture the contextual semantics of the current word information. The improved similarity BLSTM was used to map the sentence that is to be tested into two sets of low-dimensional numerical vectors. Then, we performed similarity calculation training. Experiments on the two datasets of financial and TCM show that the performance of the BTSN model in this study was better than that of other similarity calculation models. When the number of layers of the BLSTM reached 6 layers, the accuracy of the model was the highest. This verifies that the text similarity calculation model proposed in this study has high engineering value.
- Conference Article
1
- 10.1109/icisfall51598.2021.9627361
- Oct 13, 2021
With the development of new media industry, comments based user interaction is now fairly routine in live broadcasting. User comments usually appear in the form of short text with freestyle and cyber new words. The general word segmentation methods could not adapt to Chinese short text in new media comments. This paper proposes a novel method of Chinese short text segmentation to solve the problem of word segmentation granularity self-adaption. A New Media Comment Short Text Dataset(NMCD) is built for our researches, a word vector text containing cyber new words and entity words as well. Our optimized bidirectional Long Short Term Memory(LSTM) model based on attention mechanism and transfer learning could make number and its unit together after the word segmentation. The experiment results show that the Fl-score is improved by 21.43%. The word segmentation method in this paper could be efficiently applied to the new media comments analysis system later.
- Research Article
14
- 10.1016/j.eswa.2009.10.004
- Oct 27, 2009
- Expert Systems With Applications
Chinese text segmentation: A hybrid approach using transductive learning and statistical association measures
- Conference Article
- 10.1109/icdacai57211.2022.00099
- Aug 1, 2022
Because of the large number and disorder of news on the Internet, which is difficult to classify and manage accurately, a text classification method based on BiLSTM (bi-directional long short term memory) and attention mechanism is applied in this paper. Firstly, each word segment of Chinese news content is embedded into a word vector through word2vec. Then, after the feature preprocessing of the BiLSTM layer, which can learn two-way long-term dependence, the attention weight is updated by the attention mechanism. Finally, after ReLU and fully connected layers, the classifier classified the news tags. In the experiment, the THUCNews data set is used to verify the effectiveness of the method. The accuracy rate of the test set of 10000 samples is as high as 97.46%, the recall rate is 97.47%, and the F1 score is 97.45%. These three balance indexes are higher than the traditional CNN, BiLSTM, and BiLSTM+pooling classification models. The experimental results show that the BiLSTM+attention fusion model can positively affect Chinese long news text classification.
- Conference Article
3
- 10.1109/wkdd.2010.61
- Jan 1, 2010
- 2010 3rd International Conference on Knowledge Discovery and Data Mining
Since the automatic word segmentation of Chinese text will bring the lack of information, method of word segmentation according to lexical chunk as segmentation unit are proposed. Use traditional segmentation method segment Chinese text based calculate mutual information between two lexical entries and adjacent frequency of two or more lexical entries, according to this calculated value judge and sign the lexical chunk by relevant words. The experimentation shows that after the word combination, the lexical chunk bear much more feature information which shares a better effect of the process. It also has proved the effect of feature selection in Chinese text categorization and enhanced the capability of text classification.
- Conference Article
- 10.1109/iccsnt.2012.6525957
- Dec 1, 2012
In process of recognizing Chinese handwritten text, Chinese character segmentation is the key point of the recognition. Therefore, study on how to segment Chinese characters effectively plays an important role in improving the overall performance of Chinese character recognition system. This paper researches and improves algorithms in process of segmenting Chinese handwritten text. First, after image binarization processing and smooth filtering, a multi-step extract algorithm for searching nonlinear row is presented, segmenting text image into character rows. After that, segment single Chinese characters from character rows by modified Viterbi algorithm, stroke analysis and other algorithms. The paper mainly addresses the issues of row overlapping, non-touching character segmentation, touching character segmentation in process of segmentation and emerging of excessive segmentation. The experimental results show that the segmentation algorithm presented in this paper has a high anti-interference capability and a good stability, and its accuracy rate is higher.
- Conference Article
1
- 10.1109/icnisc57059.2022.00140
- Sep 1, 2022
Traditional text classification models mostly use the Word2vec and Glove to represent word vectors. When these traditional models classify Chinese short text data, they cannot well represent contextual semantic relationships and cannot completely extract text features. In this paper, the ERNIE (Enhanced Representation through Knowledge Integration) model is applied to the hybrid neural network model, which enhances the semantic representation of characters and generates character vectors by associating context semantic relations. Then the CNN (Convolutional Neural Network) and BiLSTM (Bidirectional Long Short Term Memory) are applied to the hybrid neural network to extract the characteristic information of the text data through CNN's different size convolution kernel and BiLSTM's bidirectional network structure. Moreover, in the training process, the weight decay mechanism of the AdamW algorithm is used to replace the traditional Adam algorithm to optimize the model performance. Finally, the obtained classification results are output by softmax classifier. By setting up comparative experiments on the THUCNews dataset and TouTiaoNews dataset, the results show that the Precision, Recall and F1-score of this model have been effectively improved over traditional neural network model and BERT-based model.
- Conference Article
14
- 10.3115/1073012.1073025
- Jan 1, 2001
This paper describes a system for segmenting Chinese text into words using the MBDP-1 algorithm. MBDP-1 is a knowledge-free segmentation algorithm that bootstraps its own lexicon, which starts out empty. Experiments on Chinese and English corpora show that MBDP-1 reliably outperforms the best previous algorithm when the available hand-segmented training corpus is small. As the size of the hand-segmented training corpus grows, the performance of MBDP-1 converges toward that of the best previous algorithm. The fact that MBDP-1 can be used with a small corpus is expected to be useful not only for the rare event of adapting to a new language, but also for the common event of adapting to a new genre within the same language.
- Research Article
53
- 10.1002/(sici)1097-4571(199503)46:2<83::aid-asi2>3.0.co;2-0
- Mar 1, 1995
- Journal of the American Society for Information Science
Text segmentation is a prerequisite for text retrieval systems. Chinese texts cannot be readily segmented into words because they do not contain word boundaries. ACTS is an automatic Chinese text segmentation proto-type for Chinese full text retrieval. It applies partial syntactic analysis—the analysis of morphemes, words, and phrases. The idea was originally largely inspired by experiments on English morpheme and phrase-analysis-based text retrieval, which are particularly germane to Chinese, because neither Chinese nor English texts have morpheme and phrase boundaries. ACTS is built on the hypothesis that Chinese words and phrases exceeding two characters can be characterized by a grammar that describes the concatenation behavior of the morphological and syntactic categories of their formatives. This is examined through three procedures: (1) Segmentation—texts are divided into one and two character segments by matching against a dictionary; (2) Category disambiguation—the syntactic categories of segments are determined according to context; (3) Parsing—the segments are analyzed based on the grammar, and subsequently combined into compound and complex words for indexing and retrieval. The experimental results, based on a small sample of 30 texts, show that most significant words and phrases in these texts can be extracted with a high degree of accuracy. © 1995 John Wiley & Sons, Inc.
- Ask R Discovery
- Chat PDF
AI summaries and top papers from 250M+ research sources.