Universal Word Segmentation: Implementation and Interpretation

Yan Shao,Joakim Nivre,Christian Hardmeier

doi:10.1162/tacl_a_00033

Yan Shao, Joakim Nivre + Show 1 more

Open Access

https://doi.org/10.1162/tacl_a_00033

Copy DOI

Abstract

Word segmentation is a low-level NLP task that is non-trivial for a considerable number of languages. In this paper, we present a sequence tagging framework and apply it to word segmentation for a wide range of languages with different writing systems and typological characteristics. Additionally, we investigate the correlations between various typological factors and word segmentation accuracy. The experimental results indicate that segmentation accuracy is positively related to word boundary markers and negatively to the number of unique non-segmental terms. Based on the analysis, we design a small set of language-specific settings and extensively evaluate the segmentation system on the Universal Dependencies datasets. Our model obtains state-of-the-art accuracies on all the UD languages. It performs substantially better on languages that are non-trivial to segment, such as Chinese, Japanese, Arabic and Hebrew, when compared to previous work.

Highlights

Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation
We propose a sequence tagging model and apply it to universal word segmentation
BiRNN-conditional random fields (CRF) is adopted as the fundamental segmentation framework that is complemented by an attention-based sequence-to-sequence transducer for non-segmental multiword tokens

Summary

Introduction

Word segmentation is the initial step for most higher level natural language processing tasks, such as part-of-speech tagging (POS), parsing and machine translation. It can be regarded as the problem of correctly identifying word forms from a character string. Word segmentation can be very challenging, especially for languages without explicit word boundary delimiters, such as Chinese, Japanese and Vietnamese. English or Russian, relying on white space alone generally does not result in adequate segmentation as at least punctuation should usually be separated from the attached words. The space-delimited units in the surface form are too coarse-grained and often further analysed, as in the cases of Arabic and Hebrew. Even though language-specific word segmentation systems are near-perfect for some languages, it is still useful to have a single system that performs reasonably with no or minimum language-specific adaptations

Objectives

Methods

Results

Conclusion