AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING

George E Tsekouras,Damianos Gavalas

doi:10.1142/s021819401350023x

Abstract

This article presents a novel crawling and clustering method for extracting and processing cultural data from the web in a fully automated fashion. Our architecture relies upon a focused web crawler to download web documents relevant to culture. The focused crawler is a web crawler that searches and processes only those web pages that are relevant to a particular topic. After downloading the pages, we extract from each document a number of words for each thematic cultural area, filtering the documents with non-cultural content; we then create multidimensional document vectors comprising the most frequent cultural term occurrences. We calculate the dissimilarity between the cultural-related document vectors and for each cultural theme, we use cluster analysis to partition the documents into a number of clusters. Our approach is validated via a proof-of-concept application which analyzes hundreds of web pages spanning different cultural thematic areas.

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering

Lead the way for us

Journal: International Journal of Software Engineering and Knowledge Engineering	Publication Date: Aug 1, 2013
Citations: 42

Similar Papers

Design of improved focused web crawler by analyzing semantic nature of URL and anchor text
Prashant Dahiwale ... M M Raghuwanshi
-
Prashant Dahiwale, et. al.Prashant Dahiwale ... M M Raghuwanshi
01 Dec 2014
01 Dec 2014

Googling for Health Information
Jennifer P D'Auria
Journal of Pediatric Health Care | VOL. 26
Jennifer P D'AuriaJennifer P D'Auria
21 Jun 2012
Journal of Pediatric Health Care | VOL. 26

Optimizing Crawler4j using MapReduce Programming Model
G M Siddesh ... B R Rakshitha
Journal of The Institution of Engineers (India): Series B | VOL. 98
G M Siddesh, et. al.G M Siddesh ... B R Rakshitha
12 Aug 2016
Journal of The Institution of Engineers (India): Series B | VOL. 98

A Query based Approach to Reduce the Web Crawler Traffic using HTTP Get Request and Dynamic Web Page
Anurag Jain ... Shekhar Mishra
International Journal of Computer Applications | VOL. 14
Anurag Jain, et. al.Anurag Jain ... Shekhar Mishra
12 Jan 2011
International Journal of Computer Applications | VOL. 14

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

AN EFFECTIVE FUZZY CLUSTERING ALGORITHM FOR WEB DOCUMENT CLASSIFICATION: A CASE STUDY IN CULTURAL CONTENT MINING

Abstract

Talk to us

Similar Papers

More From: International Journal of Software Engineering and Knowledge Engineering