Abstract

Despite the accessibility of numerous online corpora, students and researchers engaged in the fields of Natural Language Processing (NLP), corpus linguistics, and language learning and teaching may encounter situations in which they need to develop their own corpora. Several commercial and free standalone corpora processing systems are available to process such corpora. In this study, we first propose a framework for the evaluation of standalone corpora processing systems and then use it to evaluate seven freely available systems. The proposed framework considers the usability, functionality, and performance of the evaluated systems while taking into consideration their suitability for Arabic corpora. While the results show that most of the evaluated systems exhibited comparable usability scores, the scores for functionality and performance were substantially different with respect to support for the Arabic language and N-grams profile generation. The results of our evaluation will help potential users of the evaluated systems to choose the system that best meets their needs. More importantly, the results will help the developers of the evaluated systems to enhance their systems and developers of new corpora processing systems by providing them with a reference framework.

Highlights

  • Because of the growing interest in using corpora for linguistics research, language learning and teaching, and Natural Language Processing (NLP) [1], a vast number of corpora are available in different languages [2]

  • To evaluate the performance of the seven corpora processing tools, we measured how long each system takes to display results for three functionalities, namely, word frequency, 2-gram frequency, and concordance, for three different Arabic corpora which vary in structure and size: 2012 Corpus of Arabic Newspapers (2012 CAN) [13], KACST Arabic Text Classification Corpus (KACST ATCC) (Requested from Authors) [16], and the King Saud University Corpus of Classical Arabic (KSUCCA) [17]

  • The results show that the scores obtained in the usability dimension ranged between 69% and 80% with an average of 77% and standard deviation of 0.0371%

Read more

Summary

Introduction

Because of the growing interest in using corpora for linguistics research, language learning and teaching, and Natural Language Processing (NLP) [1], a vast number of corpora are available in different languages [2] Considerable quantities of these corpora are freely available either to explore using specially designed tools via the Internet or to download as plain text files. This paper attempts to bridge this gap by providing a framework for evaluating standalone corpus processing systems in three dimensions, usability, functionality, and performance, while taking into consideration their suitability for Arabic corpora. The availability of such a framework will help developers to enhance and improve their systems, and help the users of such systems choose the system that best fits their needs.

Related Work
Evaluation Criteria
Results
Discussion
Limitations and Conclusion
Conflict of Interests
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call