A Proposed Model for Focused Crawling and Automatic Text Classification of Online Crime Web Pages

Muneer A S Hazaa,Mohammed Albared,Fadl M Ba-Alwi

doi:10.59167/tujnas.v6i6.1329

Abstract

With the exponential growth of textual information available from the Internet, there has been an emergent need to find relevant, in-time and in-depth knowledge about crime topic. The huge size of such data makes the process of retrieving and analyzing and use of the valuable information in such texts manually a very difficult task. In this paper, we attempt to address a challenging task i.e. a crawling and classification of crime-specific knowledge on the Web. To do that, a model for online crime text crawling and classification is introduced. First, a crime-specific web crawler is designed to collect web pages of crime topic from the news websites. In this crawler, a binary Naive Bayes classifier is used for filtering crime web pages from others. Second, a multi-classes classification model is applied to categorize the crime pages into their appropriate crime types. In both steps, several feature selection methods are applied to select the most important features. Finally, the model has been evaluated on manually labeled corpus and also on online real world data. The experimental results on manually labeled corpus indicate that Naive Bayes with mutual information and odd ratio feature selection methods can accurately distinguish crime web pages from others with an F1 measure of 0.99. In addition, the experimental results also show that the Naive Bayes classification models can accurately classify crime documents to their appropriate crime types with Macro-F1 measure of 0.87. Our results also on online real word data show that the focused crawler with two-level classification is very effective for gathering high-quality collections of crime Web documents and also for classifying them.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

A Proposed Model for Focused Crawling and Automatic Text Classification of Online Crime Web Pages

Abstract

Talk to us

Similar Papers

More From: Thamar University Journal of Natural & Applied Sciences

Lead the way for us

Similar Papers

Borsa Istanbul (BIST) daily prediction using financial news and balanced feature selection
Hakan Gunduz ... Zehra Cataltepe
Expert Systems With Applications | VOL. 42
Hakan Gunduz, et. al.Hakan Gunduz ... Zehra Cataltepe
31 Jul 2015
Expert Systems With Applications | VOL. 42

Automated Classification of Fatty and Normal Liver Ultrasound Images Based on Mutual Information Feature Selection
V Sharma ... K.C Juglan
IRBM | VOL. 39
V Sharma, et. al.V Sharma ... K.C Juglan
19 Oct 2018
IRBM | VOL. 39

Amharic Character Recognition Based on Features Extracted by CNN and Auto-Encoder Models
Efrem Yohannes Obsie ... Hongchun Qu
-
Efrem Yohannes Obsie, et. al.Efrem Yohannes Obsie ... Hongchun Qu
25 Jun 2021
25 Jun 2021

Classification of Persian News Articles using Machine Learning Techniques
...
-
, et. al. ...
17 May 2021
17 May 2021

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A Proposed Model for Focused Crawling and Automatic Text Classification of Online Crime Web Pages

Abstract

Talk to us

Similar Papers

More From: Thamar University Journal of Natural &amp; Applied Sciences

More From: Thamar University Journal of Natural & Applied Sciences