Lemmatization for variation-rich languages using deep learning

Mike Kestemont,Guy De Pauw,Walter Daelemans,Renske Van Nie

doi:10.1093/llc/fqw034

Abstract

In this article, we describe a novel approach to sequence tagging for languages that are rich in (e.g. orthographic) surface variation. We focus on lemmatization, a basic step in many processing pipelines in the Digital Humanities. While this task has long been considered solved for modern languages such as English, there exist many (e.g. historic) languages for which the problem is harder to solve, due to a lack of resources and unstable orthography. Our approach is based on recent advances in the field of ‘deep’ representation learning, where neural networks have led to a dramatic increase in performance across several domains. The proposed system combines two approaches: on the one hand, we apply temporal convolutions to model the orthography of input words at the character level; secondly, we use distributional word embeddings to represent the lexical context surrounding the input words. We demonstrate how this system reaches state-of-the-art performance on a number of representative Middle Dutch data sets, even without corpus-specific parameter tuning.

Full Text