Abstract

Large Language Models (LLMs) have ushered in a major paradigm shift in Natural Language Processing (NLP), where large pre-trained Language models (LMs) have become a fundamental component of most NLP tasks. These models are intelligent enough to find relevant and meaningful representations of a language without any supervision. They are used to fine-tune typical NLP tasks with substantially higher precision than conventional shallow learning techniques. However, training these models requires a massively large corpus that adequately represents a language. Due to the availability of enormous corpora, English LLMs typically perform better than their counterparts.This effort focuses on the design and development of a large Arabic corpus. The corpus comprises over 500 GB of Arabic cleaned text, intended to improve cross-domain knowledge and downstream generalization capability of LLMs. The corpus was employed in the training of a large Arabic LLM. In order to assess the efficacy of the LLM, a variety of typical NLP tasks were fine-tuned. The fine-tuned tasks exhibited a significant boost in accuracy ranging between 4.5 and 8.5%, when compared to those downstreamed from multi-lingual BERT (mBERT). To the best of our knowledge, this is currently the largest clean and diverse Arabic corpus ever assembled.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call