Abstract
The success of a text simplification system heavily depends on the quality and quantity of complex-simple sentence pairs in the training corpus, which are extracted by aligning sentences between parallel articles. To evaluate and improve sentence alignment quality, we create two manually annotated sentence-aligned datasets from two commonly used text simplification corpora, Newsela and Wikipedia. We propose a novel neural CRF alignment model which not only leverages the sequential nature of sentences in parallel documents but also utilizes a neural sentence pair model to capture semantic similarity. Experiments demonstrate that our proposed approach outperforms all the previous work on monolingual sentence alignment task by more than 5 points in F1. We apply our CRF aligner to construct two new text simplification datasets, Newsela-Auto and Wiki-Auto, which are much larger and of better quality compared to the existing datasets. A Transformer-based seq2seq model trained on our datasets establishes a new state-of-the-art for text simplification in both automatic and human evaluation.
Highlights
Text simplification aims to rewrite complex text into simpler language while retaining its original meaning (Saggion, 2017)
Text simplification can improve the performance of many natural language processing (NLP) tasks, such as parsing (Chandrasekar et al, 1996), semantic role labelling (Vickrey and Koller, 2008), information extraction (Miwa et al, 2010), summarization (Vanderwende et al, 2007; Xu and Grishman, 2009), and machine translation (Chen et al, 2012; Stajner and Popovic, 2016)
Automatic text simplification is primarily addressed by sequence-to-sequence models whose success largely depends on the quality and quantity of the training corpus, which consists of pairs of complex-simple sentences
Summary
Text simplification aims to rewrite complex text into simpler language while retaining its original meaning (Saggion, 2017). Automatic text simplification is primarily addressed by sequence-to-sequence (seq2seq) models whose success largely depends on the quality and quantity of the training corpus, which consists of pairs of complex-simple sentences. NEWSELA (Xu et al, 2015) and WIKILARGE (Zhang and Lapata, 2017), were created by automatically aligning sentences between comparable articles. A common drawback of text simplification models trained on such datasets is that they behave conservatively, performing mostly deletion, and rarely paraphrase (Alva-Manchego et al, 2017). WIKILARGE is the concatenation of three early datasets (Zhu et al, 2010; Woodsend and Lapata, 2011; Coster and Kauchak, 2011) that are extracted from Wikipedia dumps and are known to contain many errors (Xu et al, 2015)
Published Version (
Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have