A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

Jinqu Zhang,Lang Qian,Shu Wang,Yunqiang Zhu,Zhenji Gao,Hailong Yu,Weirong Li

doi:10.1080/19475683.2023.2165543

Abstract

ABSTRACT For geoscience text, rich domain corpora have become the basis of improving the model performance in word segmentation. However, the lack of domain-specific corpus with annotation labelled has become a major obstacle to professional information mining in geoscience fields. In this paper, we propose a corpus augmentation method based on Levenshtein distance. According to the technique, a geoscience dictionary of 20,137 words was collected and constructed by crawling the keywords from published papers in China National Knowledge Infrastructure (CNKI). The dictionary was further used as the main source of synonyms to enrich the geoscience corpus according to the Levenshtein distance between words. Finally, a Chinese word segmentation model combining the BERT, Bi-gated recurrent neural network (Bi-GRU), and conditional random fields (CRF) was implemented. Geoscience corpus composed of complex long specific vocabularies has been selected to test the proposed word segmentation framework. CNN-LSTM, Bi-LSTM-CRF, and Bi-GRU-CRF models were all selected to evaluate the effects of Levenshtein data augmentation technique. Experiments results prove that the proposed methods achieve a significant performance improvement of more than 10%. It has great potential for natural languages processing tasks like named entity recognition and relation extraction.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Annals of GIS	Publication Date: Jan 12, 2023
Citations: 5	License type: open-access

R Discovery Prime

R Discovery Prime

A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

Abstract

Talk to us

Similar Papers

More From: Annals of GIS

Lead the way for us

Similar Papers

Sequence Labeling of Chinese Text Based on Bidirectional Gru-Cnn-Crf Model
Di Liu ... Xinyi Zou
-
Di Liu, et. al.Di Liu ... Xinyi Zou
01 Dec 2018
01 Dec 2018

Negation-based transfer learning for improving biomedical Named Entity Recognition and Relation Extraction
Hermenegildo Fabregat ... Lourdes Araujo
Journal of Biomedical Informatics | VOL. 138
Hermenegildo Fabregat, et. al.Hermenegildo Fabregat ... Lourdes Araujo
04 Jan 2023
Journal of Biomedical Informatics | VOL. 138

Chinese Word Segmentation Based on Self‐Learning Model and Geological Knowledge for the Geoscience Domain
Wenjia Li ... Sanfeng Li
Earth and Space Science | VOL. 8
Wenjia Li, et. al.Wenjia Li ... Sanfeng Li
01 Jun 2021
Earth and Space Science | VOL. 8

A Joint Learning Model to Extract Entities and Relations for Chinese Literature Based on Self-Attention
Li-Xin Liang ... Lin Lin
Mathematics | VOL. 10
Li-Xin Liang, et. al.Li-Xin Liang ... Lin Lin
24 Jun 2022
Mathematics | VOL. 10

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

Abstract

Talk to us

Similar Papers

More From: Annals of GIS