Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

Tu Vu,Mohit Iyyer

doi:10.18653/v1/p19-1638

Abstract

While paragraph embedding models are remarkably effective for downstream classification tasks, what they learn and encode into a single vector remains opaque. In this paper, we investigate a state-of-the-art paragraph embedding method proposed by Zhang et al. (2017) and discover that it cannot reliably tell whether a given sentence occurs in the input paragraph or not. We formulate a sentence content task to probe for this basic linguistic property and find that even a much simpler bag-of-words method has no trouble solving it. This result motivates us to replace the reconstruction-based objective of Zhang et al. (2017) with our sentence content probe objective in a semi-supervised setting. Despite its simplicity, our objective improves over paragraph reconstruction in terms of (1) downstream classification accuracies on benchmark datasets, (2) faster training, and (3) better generalization ability.

Highlights

Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al, 2017), document retrieval (Le and Mikolov, 2014), and semantic similarity and relatedness (Dai et al, 2015; Chen, 2017)
We discover that the method falls short of capturing this basic property, and that implementing a simple objective to
Motivated by our probe results, we further investigate whether incorporating the sentence content property into a paragraph encoder can help increase downstream classification accuracies

Summary

Introduction

Methods that embed a paragraph into a single vector have been successfully integrated into many NLP applications, including text classification (Zhang et al, 2017), document retrieval (Le and Mikolov, 2014), and semantic similarity and relatedness (Dai et al, 2015; Chen, 2017). We investigate the paragraph embedding method of Zhang et al (2017), which consists of a CNN-based encoder-decoder model (Sutskever et al, 2014) paired with a reconstruction objective to learn powerful paragraph embeddings that are capable of accurately reconstructing long paragraphs This model significantly improves downstream classification accuracies, outperforming LSTM-based alternatives (Li et al, 2015). Our experiments (Section 2) reveal that despite its impressive downstream performance, the model of Zhang et al (2017) substantially underperforms a simple bagof-words model on our sentence content probe Given this result, it is natural to wonder whether the sentence content property is useful for downstream classification. Our results indicate that incorporating probe objectives into downstream models might help improve both accuracy and efficiency, which we hope will spur more linguistically-informed research into paragraph embedding methods

Probing paragraph embeddings for sentence content

Probe task design

Paragraph embedding models

Probe experimental details

BoW outperforms CNN-R on sentence content

Sentence content improves paragraph classification

Related work

Conclusions and Future work

BoW models outperform more complex models on our sentence content probe

Findings

Sentence content better correlates to downstream accuracy than reconstruction

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers

Lead the way for us

Publication Date: Jan 1, 2019
Citations: 26	License type: cc-by

Similar Papers

Automatic Short Answer Scoring based on Paragraph Embeddings
Sarah Hassan ... Aly A
International Journal of Advanced Computer Science and Applications | VOL. 9
Sarah Hassan, et. al.Sarah Hassan ... Aly A
01 Jan 2018
International Journal of Advanced Computer Science and Applications | VOL. 9

Structured Paragraph Embeddings of Financial Earnings Calls
Porter Jenkins
-
Porter JenkinsPorter Jenkins
20 Apr 2020
20 Apr 2020

Incorporating paragraph embeddings and density peaks clustering for spoken document summarization
Kuan-Yu Chen ... Shih-Hung Liu
-
Kuan-Yu Chen, et. al.Kuan-Yu Chen ... Shih-Hung Liu
01 Dec 2015
01 Dec 2015

Less is More: Selective reduction of CT data for self-supervised pre-training of deep learning models with contrastive learning improves downstream classification performance
Daniel Wolf ... Timo Ropinski
Computers in Biology and Medicine | VOL. 183
Daniel Wolf, et. al.Daniel Wolf ... Timo Ropinski
09 Oct 2024
Computers in Biology and Medicine | VOL. 183

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Encouraging Paragraph Embeddings to Remember Sentence Identity Improves Classification

Abstract

Highlights

Summary

Talk to us

Similar Papers