AbstractElectronic newspapers are one of the most common sources of Modern Standard Arabic. Existing datasets of Arabic news articles typically provide a title, body, and single label. Ignoring important features, like the article author, image, tags, and publication date, can degrade the efficacy of classification models. In this paper, we propose the Arabic multi-purpose integral news articles (AMINA) dataset. AMINA is a large-scale Arabic news corpus with over 1,850,000 articles collected from 9 Arabic newspapers from different countries. It includes all the article features: title, tags, publication date and time, location, author, article image and its caption, and the number of visits. To test the efficacy of the proposed dataset, three tasks were developed and validated: article textual content (classification and generation) and article image classification. For content classification, we experimented the performance of several state-of-the-art Arabic NLP models including AraBERT and CAMeL-BERT, etc. For content generation, the reformer architecture is adopted as a character text generation model. For image classification applied on Al-Sharq and Youm7 news portals, we have compared the performance of 10 pre-trained models including ConvNeXt, MaxViT, ResNet18, etc. The overall study verifies the significance and contribution of our newly introduced Arabic articles dataset. The AMINA dataset has been released at https://huggingface.co/datasets/MohamedZayton/AMINA.
Read full abstract