Abstract

Many tasks in natural language processing involve predicting structured outputs, e.g., sequence labeling, semantic role labeling, parsing, and machine translation. Researchers are increasingly applying deep representation learning to these problems, but the structured component of these approaches is usually quite simplistic. In this work, we propose several high-order energy terms to capture complex dependencies among labels in sequence labeling, including several that consider the entire label sequence. We use neural parameterizations for these energy terms, drawing from convolutional, recurrent, and self-attention networks. We use the framework of learning energy-based inference networks (Tu and Gimpel, 2018) for dealing with the difficulties of training and inference with such models. We empirically demonstrate that this approach achieves substantial improvement using a variety of high-order energy terms on four sequence labeling tasks, while having the same decoding speed as simple, local classifiers. We also find high-order energies to help in noisy data conditions.

Highlights

  • Conditional random fields (CRFs; Lafferty et al, 2001) have been shown to perform well in various sequence labeling tasks

  • While the optimal energy function varies by task, we find strong performance from skip-chain terms with short skip distances, convolutional networks with filters that consider label trigrams, and recurrent networks and self-attention networks that consider large subsequences of labels

  • Here we find that the framework of SPEN learning with inference networks can support a wide range of high-order energies for sequence labeling

Read more

Summary

Introduction

Conditional random fields (CRFs; Lafferty et al, 2001) have been shown to perform well in various sequence labeling tasks. A major challenge with CRFs is the complexity of training and inference, which are quadratic in the number of output labels for first order models and grow exponentially when higher order dependencies are considered. This explains why the most common type of CRF used in practice is a first order model, referred to as a “linear chain” CRF. Enlarging the inference network architecture by adding one layer leads consistently to better results, rivaling or improving over a BiLSTM-CRF baseline, suggesting that training efficient inference networks with high-order energy terms can make up for errors arising from approximate inference. While we focus on sequence labeling in this paper, our results show the potential of developing high-order structured models for other NLP tasks in the future

Structured Energy-Based Learning
Inference Networks
An Objective for Joint Learning of Inference Networks
Energy Functions
Linear Chain Energies
Skip-Chain Energies
High-Order Energies
Fully-Connected Energies
Related Work
Datasets
Training
Results
Results on Noisy Datasets
Incorporating BERT
Analysis of Learned Energies
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call