Abstract

AbstractUnlike English, there is no natural separator‐like gap between words in Chinese, which makes Chinese word segmentation (CWS) a difficult information processing problem. At present, geological texts contain a large number of unregistered geological terms, and the existing rule‐based methods and machine‐learning and deep learning algorithms still cannot be used to solve the problem of word segmentation in geosciences, especially for the large number of unregistered words. In this study, we propose GeoBERTSegmenter, which is a GeoBERT‐based (Geoscience‐Bidirectional Encoder Representation from Transformers) CWS model that is specifically designed with various linguistic irregularities in mind. In this method, a general model is extended to a BERT bidirectional recurrent neural network (BiLSTM) and conditional random field (GeoBERT + BiLSTM + CRF) model with a number of features designed to address the CWS task in geological text. We also train a pretrained language model named GeoBERT on a geological domain that is based on a large amount of Chinese geological text. In open testing, a precision of 94.77%, recall of 96.31% and F1 of 95.44%, are obtained, indicating that the proposed strategy performs much better than alternative methods in our study. In this study, unregistered geological terms can be effectively identified, and the recognition rate of common words is ensured, which lays the foundation for natural language processing in the domain of geoscience through Chinese text word segmentation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call