Abstract

In English, high-quality sentence compression models by deleting words have been trained on automatically created large training datasets. We work on Japanese sentence compression by a similar approach. To create a large Japanese training dataset, a method of creating English training dataset is modified based on the characteristics of the Japanese language. The created dataset is used to train Japanese sentence compression models based on the recurrent neural network.

Highlights

  • Sentence compression is the task of shortening a sentence while preserving its important information and grammaticality

  • A high-quality English sentence compression model by deleting words was trained on a large training dataset (Filippova and Altun, 2013; Filippova et al, 2015)

  • The first model is the original Filippova et al.’s model, an encoder-decoder model with a long short-term memory (LSTM), which we extend in this paper to make the other two models that can control the output length (Kikuchi et al, 2016), because controlling the output length makes a compressed sentence more informative under the desired length

Read more

Summary

Introduction

Sentence compression is the task of shortening a sentence while preserving its important information and grammaticality. One advantage of compression by deleting words as opposed to abstractive compression lies in the small search space Another one is that the compressed sentence is more likely to be free from incorrect information not mentioned in the source sentence. A high-quality English sentence compression model by deleting words was trained on a large training dataset (Filippova and Altun, 2013; Filippova et al, 2015). Nouns, verbs, adjectives, and adverbs (i.e., content words) shared by S and H are identified by matching word lemmas, and a rooted dependency subtree that contains all the shared content words is regarded as C Their method is designed for English, and cannot be applied to Japanese as it is. The first model is the original Filippova et al.’s model, an encoder-decoder model with a long short-term memory (LSTM), which we extend in this paper to make the other two models that can control the output length (Kikuchi et al, 2016), because controlling the output length makes a compressed sentence more informative under the desired length

Creating training dataset for Japanese
Identification of shared content words
Transformation of a dependency tree
Extraction of the minimum rooted subtree
Conditions imposed on news articles
Sentence compression with LSTM
Experiments
Automatic evaluation
Human evaluation
Findings
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.