A hadoop based platform for natural language processing of web pages and documents

Paolo Nesi,Gianni Pantaleo,Gianmarco Sanesi

doi:10.1016/j.jvlc.2015.10.017

Paolo Nesi, Gianni Pantaleo + Show 1 more

Open Access

https://doi.org/10.1016/j.jvlc.2015.10.017

Copy DOI

Abstract

The rapid and extensive pervasion of information through the web has enhanced the diffusion of a huge amount of unstructured natural language textual resources. A great interest has arisen in the last decade for discovering, accessing and sharing such a vast source of knowledge. For this reason, processing very large data volumes in a reasonable time frame is becoming a major challenge and a crucial requirement for many commercial and research fields. Distributed systems, computer clusters and parallel computing paradigms have been increasingly applied in the recent years, since they introduced significant improvements for computing performance in data-intensive contexts, such as Big Data mining and analysis. Natural Language Processing, and particularly the tasks of text annotation and key feature extraction, is an application area with high computational requirements; therefore, these tasks can significantly benefit of parallel architectures. This paper presents a distributed framework for crawling web documents and running Natural Language Processing tasks in a parallel fashion. The system is based on the Apache Hadoop ecosystem and its parallel programming paradigm, called MapReduce. In the specific, we implemented a MapReduce adaptation of a GATE application and framework (a widely used open source tool for text engineering and NLP). A validation is also offered in using the solution for extracting keywords and keyphrase from web documents in a multi-node Hadoop cluster. Evaluation of performance scalability has been conducted against a real corpus of web pages and documents.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Journal of Visual Languages & Computing	Publication Date: Oct 30, 2015
Citations: 29	License type: other-oa

R Discovery Prime

R Discovery Prime

A hadoop based platform for natural language processing of web pages and documents

Abstract

Talk to us

Similar Papers

More From: Journal of Visual Languages & Computing

Lead the way for us

Similar Papers

Guest Editors Introduction: Machine Learning in Speech and Language Technologies
Pascale Fung ... Dan Roth
Machine Learning | VOL. 60
Pascale Fung, et. al.Pascale Fung ... Dan Roth
01 Sep 2005
Machine Learning | VOL. 60

Web based Content Extraction and Retrieval in Web Engineering
-
International Journal of Recent Technology and Engineering | VOL. 8
--
02 Nov 2019
International Journal of Recent Technology and Engineering | VOL. 8

Hidden Markov Model based Part of Speech Tagging for Nepali language
Abhijit Paul ... Bipul Syam Purkayastha
-
Abhijit Paul, et. al.Abhijit Paul ... Bipul Syam Purkayastha
01 Sep 2015
01 Sep 2015

Using of NLP Methods in Intelligent Educational Systems
Kostiantyn Tkachenko
Digital Platform: Information Technologies in Sociocultural Sphere | VOL. 7
Kostiantyn TkachenkoKostiantyn Tkachenko
10 Jun 2024
Digital Platform: Information Technologies in Sociocultural Sphere | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

A hadoop based platform for natural language processing of web pages and documents

Abstract

Talk to us

Similar Papers

More From: Journal of Visual Languages &amp; Computing

More From: Journal of Visual Languages & Computing