Performance Of Word Segmentation Research Articles

ABSTRACT For geoscience text, rich domain corpora have become the basis of improving the model performance in word segmentation. However, the lack of domain-specific corpus with annotation labelled has become a major obstacle to professional information mining in geoscience fields. In this paper, we propose a corpus augmentation method based on Levenshtein distance. According to the technique, a geoscience dictionary of 20,137 words was collected and constructed by crawling the keywords from published papers in China National Knowledge Infrastructure (CNKI). The dictionary was further used as the main source of synonyms to enrich the geoscience corpus according to the Levenshtein distance between words. Finally, a Chinese word segmentation model combining the BERT, Bi-gated recurrent neural network (Bi-GRU), and conditional random fields (CRF) was implemented. Geoscience corpus composed of complex long specific vocabularies has been selected to test the proposed word segmentation framework. CNN-LSTM, Bi-LSTM-CRF, and Bi-GRU-CRF models were all selected to evaluate the effects of Levenshtein data augmentation technique. Experiments results prove that the proposed methods achieve a significant performance improvement of more than 10%. It has great potential for natural languages processing tasks like named entity recognition and relation extraction.

Read full abstract

Language-specific features necessitate certain processes and skills in reading. The visually unmarked between-word boundaries in written Chinese render it critical that readers be able to segment words in the continuous texts. It may pose challenges for second language (L2) readers whose first language (L1) is word-spaced. In light of the lack of understanding of Chinese L2 readers’ word segmentation, the present study investigated 100 L2 learners’ word segmentation performance, the relationships between word segmentation and reading fluency and comprehension, and the differences in the aforementioned relationships among learners with different context-free word recognition abilities. Results demonstrated that L2 learners generally conducted word segmentation well and word segmentation contributed to reading fluency and comprehension beyond context-free word recognition. Word segmentation was the major predictor of reading fluency and comprehension among weak wordlist readers, whereas context-free word recognition made a larger contribution when learners have achieved stronger context-free word recognition ability. The findings suggested the importance of developing word segmentation skills and establishing high quality word representations in Chinese L2 reading instruction.

Read full abstract

Performance Of Word Segmentation Research Articles

Related Topics

Articles published on Performance Of Word Segmentation

A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

The role of mother-infant emotional synchrony in speech processing in 9-month-old infants

Investigating word segmentation of Chinese second language learners

Thai Word Segmentation Based on Sequence-to-Sequence Model

Transliteration recognition of Tibetan person name based on Tibetan cultural knowledge

Does visual speech information affect word segmentation?

Speech segmentation by statistical learning depends on attention

Handwritten phrase recognition as applied to street name images

World isolation and meaning in segmentation

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Performance Of Word Segmentation Research Articles

Related Topics

Articles published on Performance Of Word Segmentation

A Levenshtein distance-based method for word segmentation in corpus augmentation of geoscience texts

LATTE: Lattice ATTentive Encoding for Character-based Word Segmentation

The role of mother-infant emotional synchrony in speech processing in 9-month-old infants

Investigating word segmentation of Chinese second language learners

Thai Word Segmentation Based on Sequence-to-Sequence Model

Transliteration recognition of Tibetan person name based on Tibetan cultural knowledge

Does visual speech information affect word segmentation?

Speech segmentation by statistical learning depends on attention

Handwritten phrase recognition as applied to street name images

World isolation and meaning in segmentation