GeoBERTSegmenter: Word Segmentation of Chinese Texts in the Geoscience Domain Using the Improved BERT Model

Dongqi Wei,Zhong Xie,Dexin Xu,Kai Ma,Zhihao Liu,Liufeng Tao,Qinjun Qiu,Shengyong Pan

doi:10.1029/2022ea002511

Abstract

AbstractUnlike English, there is no natural separator‐like gap between words in Chinese, which makes Chinese word segmentation (CWS) a difficult information processing problem. At present, geological texts contain a large number of unregistered geological terms, and the existing rule‐based methods and machine‐learning and deep learning algorithms still cannot be used to solve the problem of word segmentation in geosciences, especially for the large number of unregistered words. In this study, we propose GeoBERTSegmenter, which is a GeoBERT‐based (Geoscience‐Bidirectional Encoder Representation from Transformers) CWS model that is specifically designed with various linguistic irregularities in mind. In this method, a general model is extended to a BERT bidirectional recurrent neural network (BiLSTM) and conditional random field (GeoBERT + BiLSTM + CRF) model with a number of features designed to address the CWS task in geological text. We also train a pretrained language model named GeoBERT on a geological domain that is based on a large amount of Chinese geological text. In open testing, a precision of 94.77%, recall of 96.31% and F1 of 95.44%, are obtained, indicating that the proposed strategy performs much better than alternative methods in our study. In this study, unregistered geological terms can be effectively identified, and the recognition rate of common words is ensured, which lays the foundation for natural language processing in the domain of geoscience through Chinese text word segmentation.

Full Text