A Transformer Based Encodings for Detection of Semantically Equivalent Questions in cQA

Shobhan Kumar,Arun Chauhan

doi:10.1093/comjnl/bxac003

Abstract

Abstract The probability of redundancy in questions has significantly increased due to the increasing influx of users on different cQA forums such as Quora, Stack overflow, etc. Because of this redundancy, the responses are scattered through various variations of the same question that results in unsatisfactory search results to a specific question. To address this issue, this work proposes the model for discovering the semantic similarity among the cQA questions. We followed two approaches (i) Feature-based: the question embedding is created using four forms of word embeddings and an ensemble of all four. Then Siamese LSTM (sLSTM) is used to find the semantic similarity among the questions. (ii) Fine-tuning: we fine-tuned BERT model on STS and SNLI data, which employs Siamese network architectures to generate semantically meaningful sentence embeddings. Then sBERT is used to assess the similarity between the questions. Experiments were carried out on Quora (QQP) and Stack Exchange cQA dataset with training sets of different sizes and word vectors of different dimensionalities. The model shows significant improvement over the state-of-the-artwork on sentence similarity tasks.

Full Text