Abstract

Incorporating a duplicate question detection system can be beneficial for various systems such as community forums or question answering systems. Detecting question that have already an answer improves the user experience by reducing the search time and returning the correct answer. In this paper, we construct several methods for Arabic duplicate question detection based on machine learning. First, the pre-processing step is applied to clean and normalize questions. Next, we use Term Frequency Inverse Document Frequency (TF-IDF), Word2Vec, and FastText methods to map questions from their textual format into a vector space. Then, we trained various shallow learning methods (SVM, XGBoost, Random Forest, Logistic Regression) and deep learning methods (CNN, RNN, LSTM, GRU) with the objective of detecting if a pair of questions is duplicate or not. Various experiments were conducted to evaluate the performances of our models. The results obtained show that the deep learning model based on GRU with FastText representation performed better compared to the other models.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call