Substructure Substitution: Structured Data Augmentation for NLP

Haoyue Shi,Karen Livescu,Kevin Gimpel

doi:10.18653/v1/2021.findings-acl.307

Abstract

We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB2 based on constituency parse trees, introducing structure-aware data augmentation methods to general NLP tasks. For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set. Further experiments show that SUB2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset.

Highlights

We study a family of general data augmentation methods, substructure substitution (SUB2), which generates new examples by substituting same-label substructures (Figure 1)
For more general natural language processing (NLP) tasks such as text classification, we present variations of SUB2 which (1) define substructures based on text spans or parse trees for existing examples, and (2) generate new examples by substructure substitution based on the substructures and various kinds of constraints
We investigate substructure substitution (SUB2), a family of data augmentation methods that generates new examples by same-label substructure substitution

Summary

Introduction

Data augmentation has been found effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al, 2017; Gao et al, 2019; Xia et al, 2019, inter alia), text classification (Wei and Zou, 2019; Quteineh et al, 2020), syntactic and semantic parsing (Jia and Liang, 2016; Shi et al, 2020; Dehouck and Gomez-Rodrıguez, 2020), semantic role labeling (Furstenau and Lapata, 2009), and dialogue understanding (Hou et al, 2018; Niu and Bansal, 2019) Such methods enhance the diversity of the training set by generating examples based on existing ones, and can make the learned models more robust against noise (Xie et al, 2020). To apply SUB2, we use text spans as substructures, with both the number of words in the span and the text classification label as constraints (see Sec. 3.2)

Related Work

Method

Variations of SUB2 for Text Classification

Experiments

Baselines

Part-of-Speech Tagging

Dependency Parsing

Constituency Parsing

Text Classification

Conclusion and Future Work

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Substructure Substitution: Structured Data Augmentation for NLP

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2021
Citations: 8	License type: cc-by

Similar Papers

Word Embedding for Bengali Language using Domain-related Corpus
Ashutosh Bandyopadhyay ... Jayashree Nair
-
Ashutosh Bandyopadhyay, et. al.Ashutosh Bandyopadhyay ... Jayashree Nair
26 Apr 2023
26 Apr 2023

Multi-Task Text Classification using Graph Convolutional Networks for Large-Scale Low Resource Language
Mounika Marreddy ... Subba Reddy Oota
-
Mounika Marreddy, et. al.Mounika Marreddy ... Subba Reddy Oota
18 Jul 2022
18 Jul 2022

CNO-LSTM: A Chaotic Neural Oscillatory Long Short-Term Memory Model for Text Classification
Nuobei Shi ... Raymond S T Lee
IEEE Access | VOL. 10
Nuobei Shi, et. al.Nuobei Shi ... Raymond S T Lee
01 Jan 2021
IEEE Access | VOL. 10

Advancing NLP models with strategic text augmentation: A comprehensive study of augmentation methods and curriculum strategies
Himmet Toprak Kesgin ... Mehmet Fatih Amasyali
Natural Language Processing Journal | VOL. 7
Himmet Toprak Kesgin, et. al.Himmet Toprak Kesgin ... Mehmet Fatih Amasyali
13 Apr 2024
Natural Language Processing Journal | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Substructure Substitution: Structured Data Augmentation for NLP

Abstract

Highlights

Summary

Talk to us

Similar Papers