Abstract

We present the first multi-task learning model – named PhoNLP – for joint Vietnamese part-of-speech (POS) tagging, named entity recognition (NER) and dependency parsing. Experiments on Vietnamese benchmark datasets show that PhoNLP produces state-of-the-art results, outperforming a single-task learning approach that fine-tunes the pre-trained Vietnamese language model PhoBERT (Nguyen and Nguyen, 2020) for each task independently. We publicly release PhoNLP as an open-source toolkit under the Apache License 2.0. Although we specify PhoNLP for Vietnamese, our PhoNLP training and evaluation command scripts in fact can directly work for other languages that have a pre-trained BERT-based language model and gold annotated corpora available for the three tasks of POS tagging, NER and dependency parsing. We hope that PhoNLP can serve as a strong baseline and useful toolkit for future NLP research and applications to not only Vietnamese but also the other languages. Our PhoNLP is available at https://github.com/VinAIResearch/PhoNLP

Highlights

  • Multi-task learning is a promising solution as it might help reduce the Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/20062010 and VLSP workshops that have run shared tasks since 2013.1 Fundamental tasks of POS tagging, NER and dependency parsing play important roles, providing useful features for many downstream application tasks such as machine translation (Tran et al, 2016), sentiment analysis (Bang and Sornlertlamvanich, 2018), relation extraction (To and Do, 2020), semantic parsing (Nguyen et al, 2020), open information extraction (Truong et al, 2017) and question answering storage space

  • Given an input sentence of words to PhoNLP, an encoding layer generates contextualized word embeddings that represent the input words. These contextualized word embeddings are fed into a POS tagging layer that is a linear prediction layer (Devlin et al, 2019) to predict POS tags for the

  • Our PhoNLP can be viewed as an extension of previous joint POS tagging and dependency parsing models (Hashimoto et al, 2017; Li et al, 2018; Nguyen and Verspoor, 2018; Nguyen, 2019; Kondratyuk and Straka, 2019), where we incorporate a CRF-based prediction layer for NER

Read more

Summary

Introduction

Vietnamese NLP research has been significantly explored recently. It has been boosted by the success of the national project on Vietnamese language and speech processing (VLSP) KC01.01/20062010 and VLSP workshops that have run shared tasks since 2013.1 Fundamental tasks of POS tagging, NER and dependency parsing play important roles, providing useful features for many downstream application tasks such as machine translation (Tran et al, 2016), sentiment analysis (Bang and Sornlertlamvanich, 2018), relation extraction (To and Do, 2020), semantic parsing (Nguyen et al, 2020), open information extraction (Truong et al, 2017) and question answering storage space. Based on both the contextualized word embeddings and the 2 Model description “soft” POS tag embeddings, the NER layer uses a linear-chain CRF predictor (Lafferty et al, 2001) to predict NER labels for the input words, while the dependency parsing layer uses a Biaffine classifier (Dozat and Manning, 2017) to predict dependency arcs between the words and another Biaffine clas-. An objective loss LDEP is computed by summing a cross entropy loss for unlabeled dependency parsing and another cross entropy loss for where following Hashimoto et al (2017), the “soft” POS tag embedding t(i1) is computed by multiplying a label weight matrix W(1) with the corresponding probability vector pi: dependency label prediction during training based on gold arcs and arc labels. PhoNLP is the weighted sum of the POS tagging loss LPOS, the NER loss LNER and the dependency parsing loss LDEP:

Discussion
Dependency parsing
Implementation
Experiments
VinAI Np B-ORG 4 pob CH O
Findings
PhoNLP toolkit
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call