Abstract
Large Language Models (LLMs) have ushered in a major paradigm shift in Natural Language Processing (NLP), where large pre-trained Language models (LMs) have become a fundamental component of most NLP tasks. These models are intelligent enough to find relevant and meaningful representations of a language without any supervision. They are used to fine-tune typical NLP tasks with substantially higher precision than conventional shallow learning techniques. However, training these models requires a massively large corpus that adequately represents a language. Due to the availability of enormous corpora, English LLMs typically perform better than their counterparts.This effort focuses on the design and development of a large Arabic corpus. The corpus comprises over 500 GB of Arabic cleaned text, intended to improve cross-domain knowledge and downstream generalization capability of LLMs. The corpus was employed in the training of a large Arabic LLM. In order to assess the efficacy of the LLM, a variety of typical NLP tasks were fine-tuned. The fine-tuned tasks exhibited a significant boost in accuracy ranging between 4.5 and 8.5%, when compared to those downstreamed from multi-lingual BERT (mBERT). To the best of our knowledge, this is currently the largest clean and diverse Arabic corpus ever assembled.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.