Abstract

Natural language processing is a useful processing technique of language data, such as text and speech. Sequence labeling represents the upstream task of many natural language processing tasks, such as machine translation, text classification, and sentiment classification. In this paper, the focus is on the sequence labeling task, in which semantic labels are assigned to each unit of a given input sequence. Two frameworks of latent variable conditional random fields (CRF) models (called LVCRF-I and LVCRF-II) are proposed, which use the encoding schema as a latent variable to capture the latent structure of the hidden variables and the observed data. Among the two designed models, the LVCRF-I model focuses on the sentence level, while the LVCRF-II works in the word level, to choose the best encoding schema for a given input sequence automatically without handcraft features. In the experiments, the two proposed models are verified by four sequence prediction tasks, including named entity recognition (NER), chunking, reference parsing and POS tagging. The proposed frameworks achieve better performance without using other handcraft features than the conventional CRF model. Moreover, these designed frameworks can be viewed as a substitution of the conventional CRF models. In the commonly used LSTM-CRF models, the CRF layer can be replaced with our proposed framework as they use the same training and inference procedure. The experimental results show that the proposed models exhibit latent variable and provide competitive and robust performance on all three sequence prediction tasks.

Highlights

  • Sequence labeling is often the first step in text data processing

  • In order to provide a clear explanation of the models, we briefly introduce the conventional conditional random fields (CRF) model, present the proposed latent variable CRF models, and explain the main difference between these models

  • We propose a framework of the CRF that uses the encoding schema as a latent variable

Read more

Summary

Introduction

Sequence labeling represents the task of identifying and assigning a semantic label to each unit/subsequence of the input sequences. With B in the BIO encoding schema because it represents the beginning of a person entity, and it is marked with U in the BILOU encoding schema wherein it denotes a unit length person entity. Different encoding schemas can lead to different performance on different models and sequence-labeling tasks. In this paper, two latent variable CRFs, which can automatically choose the best encoding scheme for a given input sentence, are proposed. The two proposed models use different encoding schemas, as a latent variable in the conventional CRF in two ways. The performance of the proposed latent variable model is much better than the conventional CRF with the BIO or BILOU encoding schema

Literature review
Latent variable CRF
Encoding Schema
Problem Statement
Proposed latent variable CRF models
Conventional CRF
ZðxÞ expfW
Latent Variable CRF-I
Latent variable CRF-II
Features
Experimental evaluation
Datasets
Named entity recognition
Reference parsing
Chunking
POS tagging
Conclusion
Declaration of Competing Interest
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call