Duplicate Question Detection With Deep Learning in Stack Overflow

Liting Wang,Jing Jiang,Li Zhang

doi:10.1109/access.2020.2968391

Abstract

Stack Overflow is a popular Community-based Question Answer (CQA) website focused on software programming and has attracted more and more users in recent years. However, duplicate questions frequently appear in Stack Overflow and they are manually marked by the users with high reputation. Automatic duplicate question detection alleviates labor and effort for users with high reputation. Although existing approaches extract textual features to automatically detect duplicate questions, these approaches are limited since semantic information could be lost. To tackle this problem, we explore the use of powerful deep learning techniques, including Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) and Long Short-Term Memory (LSTM), to detect duplicate questions in Stack Overflow. In addition, we use Word2Vec to obtain the vector representations of words. They can fully capture semantic information at document-level and word-level respectively. Therefore, we construct three deep learning approaches WV-CNN, WV-RNN and WV-LSTM, which are based on Word2Vec, CNN, RNN and LSTM, to detect duplicate questions in Stack Overflow. Evaluation results show that WV-CNN and WV-LSTM have made significant improvements over four baseline approaches (i.e., DupPredictor, Dupe, DupPredictorRep-T, and DupeRep) and three deep learning approaches (i.e., DQ-CNN, DQ-RNN, and DQ-LSTM) in terms of recall-rate@5, recall-rate@10 and recall-rate@20. Furthermore, the experimental results indicate that our approaches WV-CNN, WV-RNN, and WV-LSTM outperform four machine learning approaches based on Support Vector Machine, Logic Regression, Random Forest and eXtreme Gradient Boosting in terms of recall-rate@5, recall-rate@10 and recall-rate@20.

Highlights

There are some Community-based Question Answering (CQA) websites that are becoming increasingly popular, such as Quora,1 Yahoo! Answers,2 and Stack Overflow.3 Stack Overflow is a CQA website about software programming
We focus on three research questions (RQ): RQ1: Which of our three approaches (i.e., WV-Convolutional Neural Networks (CNN), WV-Recurrent Neural Networks (RNN), and WV-Long Short-Term Memory (LSTM)) is superior to the four baseline approaches (i.e., DupPredictor, Dupe, DupPredictorRep-T, and DupeRep) for solving the problem of duplicate question detection?
RQ2: The experimental results illustrate that WVCNN, WV-RNN and WV-LSTM, which are based on Word2Vec, CNN, RNN and LSTM, outperform the four machine learning approaches SVM, LR, RF and Xgboost for six different question groups in terms of recallrate@5, recall-rate@10 and recall-rate@20

Summary

INTRODUCTION

There are some Community-based Question Answering (CQA) websites that are becoming increasingly popular, such as Quora, Yahoo! Answers, and Stack Overflow. Stack Overflow is a CQA website about software programming. In our previous work [12], three deep learning approaches (i.e., DQ-CNN, DQ-RNN, and DQ-LSTM) are explored based on CNN, RNN and LSTM to solve the problem of duplicate question detection in Stack Overflow. Four different machine learning approaches (i.e., Support Vector Machine (SVM) [18], Logic Regression (LR) [19], Random Forest (RF) [20], and eXtreme Gradient Boosting (Xgboost) [21]) are used to detect duplicate questions in our previous work [12] We compare these four machine learning (i.e., SVM, LR, RF, and Xgboost) with our three approaches (i.e., WV-CNN, WV-RNN, and WV-LSTM) to analyze the effectiveness of traditional machine learning approaches and deep learning approaches in duplicate question detection tasks.

BACKGROUND

PROBLEM FORMULATION AND WORD EMBEDDING

EVALUATION

EVALUATION METRIC

THREATS TO VALIDITY

RELATED WORK

CONCLUSION

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: IEEE Access	Publication Date: Jan 1, 2020
Citations: 39	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Duplicate Question Detection With Deep Learning in Stack Overflow

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access

Lead the way for us

Similar Papers

Detecting Duplicate Questions in Stack Overflow via Deep Learning Approaches
Liting Wang ... Li Zhang
-
Liting Wang, et. al.Liting Wang ... Li Zhang
01 Dec 2019
01 Dec 2019

Detecting Duplicate Questions in Stack Overflow via Semantic and Relevance Approaches
Zhifang Liao ... Song Yu
-
Zhifang Liao, et. al.Zhifang Liao ... Song Yu
01 Dec 2021
01 Dec 2021

Mining duplicate questions in stack overflow
Muhammad Ahasanuzzaman ... Chanchal K Roy
-
Muhammad Ahasanuzzaman, et. al.Muhammad Ahasanuzzaman ... Chanchal K Roy
14 May 2016
14 May 2016

How Developers and Tools Categorize Sentiment in Stack Overflow Questions - A Pilot Study
Niloofar Mansoor ... Cole S Peterson
-
Niloofar Mansoor, et. al.Niloofar Mansoor ... Cole S Peterson
01 May 2021
01 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Duplicate Question Detection With Deep Learning in Stack Overflow

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: IEEE Access