Quran question and answer corpus for data mining with WEKA

Bothaina Hamoud,Eric Atwell

doi:10.1109/sgcac.2016.7458032

Abstract

This paper presents the compilation of a holy Quran question and answer dataset corpus, created for data mining with Waikato Environment for Knowledge Analysis (WEKA). Questions and answers from the Quran were collected from multiple data sources, and then a representative sample of the question and answers were selected to be used in our model. Then the data was cleaned to improve data quality to the level required by the WEKA tool, and then converted to a comma separated value (CSV) file format to provide a suitable corpus dataset that can be loaded into WEKA. Then StringToWordVector filter was used to process each string into a bag or vector of word frequencies for further analysis with different data mining techniques. After that we applied a clustering algorithm to the processed attributes, and show the WEKA cluster visualizer.

Full Text