Abstract

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

Highlights

  • Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation

  • We propose a sequence tagging model and apply it to universal word segmentation

  • BiRNN-conditional random fields (CRF) is adopted as the fundamental segmentation framework that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens

Read more

Summary

Introduction

Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. Word segmentation can be very challenging, especially for languages without explicit word boundary delimiters, such as Chinese, Japanese and Vietnamese. English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. The space-delimited units in the surface form are too coarse-grained and often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations

Objectives
Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call