Fine-Grained Quran Dataset

Mohamed Osman,Mohammad Alhawarat,Anwer Hilal

doi:10.14569/ijacsa.2015.061241

Mohamed Osman, Mohammad Alhawarat + Show 1 more

Open Access

https://doi.org/10.14569/ijacsa.2015.061241

Copy DOI

Abstract

Extracting knowledge from text documents has become one of the main hot topics in the field of Natural Language Processing (NLP) in the era of information explosion. Arabic NLP is considered immature due to several reasons including the low available resources. On the other hand, automatically extracting reliable knowledge from specialized data sources as holy books is considered ultimately a challenging task but of great benefit to all humans. In this context, this paper provides a comprehensive Quranic Dataset as a first part (foundation) of an ongoing research that attempts to lay grounds for approaches and applications to explore the holy Quran. The paper presents the algorithms and approaches that have been designed to extract an aggregative data from massive Arabic text sources including the holy Quran and tightly associated books. Holy Quran text is transferred into structured multi-dimensional data records starting from the chapter level, the word level and then the character level. All these are linked with interpretations and meanings, parsing, translations, intonation roots and stems of words, all from authentic and reliable sources. The final dataset is represented in excel sheets and database records format. Also, the paper presents models of the dataset at all levels. The Quranic dataset presented in this paper was designed to be appropriate for: database, data mining, text mining and Artificial Intelligence applications; it is also designed to serve as a comprehensive encyclopedia of holy Quran and the Quranic Science books.

Highlights

In recent years, large amount of language datasets and corpora have been developed, these are increased with the spread of cloud computing applications and data linking
This study aims to build a group of datasets for the holy Quran, its interpretations, its meanings and related scientific books
As its shows in figures 9- 16 the results are matching the real data, for example: 1) The holy Quran is composed of 114 chapters and 6,236 verses with 77,477 words [1], when we compare this statistics with the statistics that generated from our data set we found the same result

Summary

Introduction

Large amount of language datasets and corpora have been developed, these are increased with the spread of cloud computing applications and data linking. This study aims to build a group of datasets for the holy Quran, its interpretations, its meanings and related scientific books. The holy Quran is composed of 114 chapters and about 6,236 verses with 77,477 words [1] This group of books addresses every verse and word by interpretation, parsing, clarification; supporting these with the reasons of the revelations and the sayings (Hadith) of prophet Mohammad Peace Be Upon Him (PBUH). This results in a massive amount of text which could make it hard to process separately in the form of unstructured text. This study focus in developing a model (algorithms and methodologies) to build a homogeneous dataset that fit all of these contents to produce a set of comprehensive structured data for the Holy Quran and its Scientific books, to be used as an encyclopedia for the Holy Quran and to serve as infrastructure for the technical applications that seek to produce results and carry out research on this vast amount of data

Objectives

Methods

Conclusion