Abstract
Due to the advances in technology, social media has become the most popular means for the propagation of news. Many news items are published on social media like Facebook, Twitter, Instagram, etc. but are not categorized into various different domains, such as politics, education, finance, art, sports, and health. Thus, text classification is needed to classify the news into different domains to reduce the huge amount of news available over social media, reduce time and effort for recognizing the category or domain, and present data to improve the searching process. Most existing datasets don’t follow pre-processing and filtering processes and aren’t organized based on classification standards to be ready for use. Thus, the Arabic Natural Processing Language (ANLP) phases will be used to pre-process, normalize, and categorize the news into the right domain. This paper proposes an Arabic Social Media Dataset (SMAD) for text classification purposes over the social media using ANLP steps. The SMAD dataset consists of 15,240 Arabic news items categorized over the Facebook social network. The experimental results illustrate that the SMAD corpus gives accuracy of about 98% in five domains (Art, Education, Health, Politics, and Sport). The SMAD dataset has been trained tested and is ready for use.
Highlights
The news media has transformed from hardcopy like newspapers, radios, and magazines to digital forms integrated with the internet to organize social media platforms like Facebook, Twitter, blogs, channels, and other digital media formats
This paper presents Social Media Dataset (SMAD) dataset, a new Arabic social media dataset built across Facebook social media for news sources using the hybrid approach Arabic Natural Language Processing (ANLP) standard classification to cover five different domains (Sports, Arts, Health, Education and Political) domains
Section B will present the main results of the accuracy and quality metrics (Recall, precision, and F measure) of the SMAD dataset and compares the performance improvement of the personalized model, with other similarity metrics for baseline datasets in different domains
Summary
The news media has transformed from hardcopy like newspapers, radios, and magazines to digital forms integrated with the internet to organize social media platforms like Facebook, Twitter, blogs, channels, and other digital media formats. Users of social media share news, communicate with other people, and create more posts and tweets related to the news than they consume. A huge amount of incredible news is created and propagated through social media, which has a serious impact on society and individuals. Various social media needed to categorize their news into different domains, like politics, education, finance, art, sports, and health. Text classification is used to reduce the huge amount of news available over the social media. It is useful for reducing time and effort for recognizing the category or domain, and the data will be pretreated to improve the searching process and performance of classification
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
More From: International Journal of Advanced Computer Science and Applications
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.