Abstract

We study a family of data augmentation methods, substructure substitution (SUB2), for natural language processing (NLP) tasks. SUB2 generates new examples by substituting substructures (e.g., subtrees or subsequences) with ones with the same label, which can be applied to many structured NLP tasks such as part-of-speech tagging and parsing. For more general tasks (e.g., text classification) which do not have explicitly annotated substructures, we present variations of SUB2 based on constituency parse trees, introducing structure-aware data augmentation methods to general NLP tasks. For most cases, training with the augmented dataset by SUB2 achieves better performance than training with the original training set. Further experiments show that SUB2 has more consistent performance than other investigated augmentation methods, across different tasks and sizes of the seed dataset.

Highlights

  • We study a family of general data augmentation methods, substructure substitution (SUB2), which generates new examples by substituting same-label substructures (Figure 1)

  • For more general natural language processing (NLP) tasks such as text classification, we present variations of SUB2 which (1) define substructures based on text spans or parse trees for existing examples, and (2) generate new examples by substructure substitution based on the substructures and various kinds of constraints

  • We investigate substructure substitution (SUB2), a family of data augmentation methods that generates new examples by same-label substructure substitution

Read more

Summary

Introduction

Data augmentation has been found effective for various natural language processing (NLP) tasks, such as machine translation (Fadaee et al, 2017; Gao et al, 2019; Xia et al, 2019, inter alia), text classification (Wei and Zou, 2019; Quteineh et al, 2020), syntactic and semantic parsing (Jia and Liang, 2016; Shi et al, 2020; Dehouck and Gomez-Rodrıguez, 2020), semantic role labeling (Furstenau and Lapata, 2009), and dialogue understanding (Hou et al, 2018; Niu and Bansal, 2019) Such methods enhance the diversity of the training set by generating examples based on existing ones, and can make the learned models more robust against noise (Xie et al, 2020). To apply SUB2, we use text spans as substructures, with both the number of words in the span and the text classification label as constraints (see Sec. 3.2)

Related Work
Method
Variations of SUB2 for Text Classification
Experiments
Baselines
Part-of-Speech Tagging
Dependency Parsing
Constituency Parsing
Text Classification
Conclusion and Future Work
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.