BERTCWS: unsupervised multi-granular Chinese word segmentation based on a BERT method for the geoscience domain

Qinjun Qiu,Zhong Xie,Kai Ma,Miao Tian

doi:10.1080/19475683.2023.2186487

Abstract

ABSTRACT Unlike alphabet-based languages such as English, the Chinese language has no specifying word boundaries. Segmentation, particularly for the Chinese language, is a fundamental step towards Chinese text processing, information retrieval, and knowledge discovery. In the geoscience domain, most existing Chinese word segmentation tools/models require a prespecified dictionary and a large amount of relevant training corpus, and the segmentation accuracies drop significantly when processing out-domain situations using these same methods. To address this issue, a purely unsupervised and generic two-stage architecture (named BERTCWS) for domain-specific Chinese word segmentation is proposed. We first design an incidence matrix termed the ‘character combination tightness’ to calculate the closeness between characters. Then, BERTCWS recognizes geoscience terms based on a Bidirectional Encoder Representations from Transformers(BERT)-based segmenter, and multi-granular segmentation is generated by setting different thresholds. Finally, the discriminator is constructed to validate the correctness of the segmented words. Our numerical study demonstrates that BERTCWS can identify both general-domain terms and geoscience-domain terms. Additionally, multi-granular segmentation could be applied to offer a set of potential geoscience terms of various lengths.

Full Text