Abstract
While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.
Highlights
Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al, 2017), document retrieval (Le and Mikolov, 2014), and semantic similarity and relatedness (Dai et al, 2015; Chen, 2017)
We discover that the method falls short of capturing this basic property, and that implementing a simple objective to
Motivated by our probe results, we further investigate whether incorporating the sentence content property into a paragraph encoder can help increase downstream classification accuracies
Summary
Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al, 2017), document retrieval (Le and Mikolov, 2014), and semantic similarity and relatedness (Dai et al, 2015; Chen, 2017). We investigate the paragraph embedding method of Zhang et al (2017), which consists of a CNN-based encoder-decoder model (Sutskever et al, 2014) paired with a reconstruction objective to learn powerful paragraph embeddings that are capable of accurately reconstructing long paragraphs This model significantly improves downstream classification accuracies, outperforming LSTM-based alternatives (Li et al, 2015). Our experiments (Section 2) reveal that despite its impressive downstream performance, the model of Zhang et al (2017) substantially underperforms a simple bagof-words model on our sentence content probe Given this result, it is natural to wonder whether the sentence content property is useful for downstream classification. Our results indicate that incorporating probe objectives into downstream models might help improve both accuracy and efficiency, which we hope will spur more linguistically-informed research into paragraph embedding methods
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.