Large Scale Image Dataset Construction Using Distributed Crawling with Hadoop YARN

Asmat Ali,Asad Masood Khatak,Rahman Ali,Muhammad Saqlain Aslam

doi:10.1109/scis-isis.2018.00075

Abstract

with the rapid advancement in the internet, we are now living in the era of big data. The image data over the web has the potential to assist in the development of sophisticated and robust models and algorithms to interact with images and multimedia data. Images Data sets are widely used in image processing tasks and analyses. They are in use in various fields including Artificial Intelligence, Data extraction and collection, Computer Vision, Research studies and education. In this research work, we are going to propose a system that crawls the web in a systematic manner using Hadoop MapReduce technique to collect images from millions of web pages on the web. With Celebrity images just a use case, we would be able to search for and retrieve any image tagged with some specific terms. The system uses some simple techniques to reduce noisy images like thumbnails and icons. The proposed system is based on Apache Hadoop and Apache Nutch, an open source web crawler. A customized crawl is run through Apache Nutch in a Hadoop Cluster that searches images for one or more categories on the web and retrieves their links. Next, HIPI, Hadoop Image Processing Interface is used to download the images and create datasets for an individual category or a dataset of multiple categories.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Large Scale Image Dataset Construction Using Distributed Crawling with Hadoop YARN

Abstract

Talk to us

Similar Papers

Lead the way for us

Similar Papers

FEATURE EXTRACTION IN HADOOP IMAGE PROCESSING INTERFACE
...
International Journal of Advance Research and Innovative Ideas in Education | VOL. 4
, et. al. ...
01 Jan 2018
International Journal of Advance Research and Innovative Ideas in Education | VOL. 4

Parallel and Distributed Computing for Processing Big Image and Video Data
Praveen Kumar ... Swapnil Arsh
-
Praveen Kumar, et. al.Praveen Kumar ... Swapnil Arsh
01 Jan 2019
01 Jan 2019

Resource management in cluster computing platforms for large scale data processing
Yi Yao
-
Yi YaoYi Yao
10 May 2021
10 May 2021

Performance of a Low Cost Hadoop Cluster for Image Analysis in Cloud Robotics Environment
Basit Qureshi ... Maram Alajlan
Procedia Computer Science | VOL. 82
Basit Qureshi, et. al.Basit Qureshi ... Maram Alajlan
01 Jan 2015
Procedia Computer Science | VOL. 82

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Large Scale Image Dataset Construction Using Distributed Crawling with Hadoop YARN

Abstract

Talk to us

Similar Papers