Abstract

with the rapid advancement in the internet, we are now living in the era of big data. The image data over the web has the potential to assist in the development of sophisticated and robust models and algorithms to interact with images and multimedia data. Images Data sets are widely used in image processing tasks and analyses. They are in use in various fields including Artificial Intelligence, Data extraction and collection, Computer Vision, Research studies and education. In this research work, we are going to propose a system that crawls the web in a systematic manner using Hadoop MapReduce technique to collect images from millions of web pages on the web. With Celebrity images just a use case, we would be able to search for and retrieve any image tagged with some specific terms. The system uses some simple techniques to reduce noisy images like thumbnails and icons. The proposed system is based on Apache Hadoop and Apache Nutch, an open source web crawler. A customized crawl is run through Apache Nutch in a Hadoop Cluster that searches images for one or more categories on the web and retrieves their links. Next, HIPI, Hadoop Image Processing Interface is used to download the images and create datasets for an individual category or a dataset of multiple categories.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.