Abstract

In early January 2020, after China reported the first cases of the new coronavirus (SARS-CoV-2) in the city of Wuhan, unreliable and not fully accurate information has started spreading faster than the virus itself. Alongside this pandemic, people have experienced a parallel infodemic, i.e., an overabundance of information, some of which is misleading or even harmful, which has widely spread around the globe. Although social media are increasingly being used as the information source, web search engines, such as Google or Yahoo!, still represent a powerful and trustworthy resource for finding information on the Web. This is due to their capability to capture the largest amount of information, helping users quickly identify the most relevant, useful, although not always the most reliable, results for their search queries. This study aims to detect potential misleading and fake contents by capturing and analysing textual information, which flow through search engines. By using a real-world dataset associated with recent COVID-19 pandemic, we first apply re-sampling techniques for class imbalance, and then we use existing machine learning algorithms for classification of not reliable news. By extracting lexical and host-based features of associated uniform resource locators (URLs) for news articles, we show that the proposed methods, so common in phishing and malicious URL detection, can improve the efficiency and performance of classifiers. Based on these findings, we suggest that the use of both textual and URL features can improve the effectiveness of fake news detection methods.

Highlights

  • The reliability and credibility of both the information source and information itself have emerged as a global issue in contemporary society [1, 2]

  • Based on the analysis we performed in uniform resource locators (URLs) Analysis, we observed a positive influence on the F1 score and recall metrics (Figure 8) in some machine learning (ML) classifiers, after including the most relevant features extracted from URLs

  • We analysed metadata information extracted from web search engines, after submitting specific search queries related to the COVID-19 outbreak, simulating a normal user’s activity

Read more

Summary

Introduction

The reliability and credibility of both the information source and information itself have emerged as a global issue in contemporary society [1, 2]. In the last decades, social media have revolutionised the way in which information spreads across the Web and, more generally, the world [3, 4], by allowing users to freely share content faster than traditional news sources. The fact that content spreads so quickly and across platforms suggests that people (and algorithms behind the platforms) are potentially vulnerable to misinformation, hoaxes, biases, and low-credibility contents which are daily shared, accidentally or intentionally. The problem of spreading misinformation, affects the social media platforms and the World Wide Web. every time people enter a search query on web search engines (WSEs), such as Google or Bing, they can view and potentially access hundreds, or thousands, of web pages with helpful information, sometimes potentially misleading. Meta title tags displayed on search engine results pages (SERPs) [5] represent a crucial factor in helping the user understand pages’ content, being

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.