Chinese word segmentation (CWS), which involves splitting the sequence of Chinese characters into words, is a key task in natural language processing (NLP) for Chinese. However, the complexity and flexibility of geologic terms require that domain-specific knowledge be utilized in CWS for geoscience domains. Previous studies have identified several challenges that have an impact on CWS in the geoscience domain, including the absence of abundant labeled data and difficult-to-delineate complex geological word boundaries. To solve these problems, a novel semi-supervised deep learning framework, GeoCWS, is developed for CWS in the geoscience domain. The framework is designed with domain-enhanced features and an uncertainty-aware self-training strategy. First, n-grams are automatically constructed from the input text as a pseudo-lexicon. Then, a backbone model is suggested that learns domain-enhanced features by introducing a pseudo-lexicon-based memory mechanism to delineate complex geological word boundaries based on BERT. Next, the backbone model is fine-tuned with a small amount of labeled data to obtain the teacher model. Finally, we design a self-training strategy with joint confidence and uncertainty awareness to improve the generalization ability of the backbone model to unlabeled data. Our method outperformed the state-of-the-art baseline methods in extensive experiments, and ablation experiments verified the effectiveness of the proposed backbone model and self-training strategy.
Read full abstract