A non-DNN Feature Engineering Approach to Dependency Parsing – FBAML at CoNLL 2017 Shared Task

Xian Qian,Yang Liu

doi:10.18653/v1/k17-3015

Abstract

For this year’s multilingual dependency parsing shared task, we developed a pipeline system, which uses a variety of features for each of its components. Unlike the recent popular deep learning approaches that learn low dimensional dense features using non-linear classifier, our system uses structured linear classi- fiers to learn millions of sparse features. Specifically, we trained a linear classifier for sentence boundary prediction, linear chain conditional random fields (CRFs) for tokenization, part-of-speech tagging and morph analysis. A second order graph based parser learns the tree structure (without relations), and a linear tree CRF then assigns relations to the dependencies in the tree. Our system achieves reasonable performance –67.87% official averaged macro F1 score.

Highlights

Tokenization, POS tagger and morphologic analyzer are based on linear chain conditional random fields (CRFs) (Lafferty et al, 2001), and the relation predictor is based on linear tree CRFs
We train the pipeline for each language independently using the training portion of the treebank and the official word embeddings for 45 languages provided by the organizers
Linear chain CRF is used to learn the model with character and word n-gram features

Summary

Introduction

Our system for the universal dependency parsing shared task in CoNLL 2017 (Zeman et al, 2017) follows a typical pipeline framework.The system architecture is shown in Figure 1, which consists of the following components : (1) sentence segmentor, which segments raw text into sentences, (2) tokenizer that tokenizes sentences into words, or performs word segmentation for Asian languages, (3) morphologic analyzer generates morphologic features, (4) part-of-speech (POS) tagger generates universal POS tags and language specific POS tags, (5) parser predicts tree structures without relations, (6) a relation predictor assigns relations to the dependencies in the tree.For each component, we take a non deep learning based approach, that is the typical structured linear classifier that learns sparse features, but requires heavy feature engineering.Sentence segmentation, tokenization, POS tagger and morphologic analyzer are based on linear chain CRFs (Lafferty et al, 2001), and the relation predictor is based on linear tree CRFs. Our system for the universal dependency parsing shared task in CoNLL 2017 (Zeman et al, 2017) follows a typical pipeline framework. The system architecture is shown, which consists of the following components : (1) sentence segmentor, which segments raw text into sentences, (2) tokenizer that tokenizes sentences into words, or performs word segmentation for Asian languages, (3) morphologic analyzer generates morphologic features, (4) part-of-speech (POS) tagger generates universal POS tags and language specific POS tags, (5) parser predicts tree structures without relations, (6) a relation predictor assigns relations to the dependencies in the tree. Tokenization, POS tagger and morphologic analyzer are based on linear chain CRFs (Lafferty et al, 2001), and the relation predictor is based on linear tree CRFs. We train the pipeline for each language independently using the training portion of the treebank and the official word embeddings for 45 languages provided by the organizers. Due to the time limit, we did not optimize our system for speed or memory

Methods

Results

Conclusion