Abstract

Focused crawling has wide number of applications in the area of Information Retrieval. It is a crucial part in building domain specific search engines, personalized search tools and extending digital libraries. Be it Google Scholar to search for scholarly articles or Google news to search for news articles, domain specific search is the most widely acclaimed application of focused crawling. Unfortunately, there are very few domain specific search engines available for Indian languages.Sandhan is one such project which offers domain specific search for tourism and health domains across 10 major Indian languages. The amount of Indian language content on web is less compared to other languages. When we restrict the search space to a specific domain (say tourism) the probability of finding relevant pages reduces. Hence recall plays a major role in such a scenario. Due to the tendency of Indian language web pages linking to other language pages usually English, traditional crawling methods with well chosen seeds would end up crawling a lot of unnecessary content. This means that to gain a little recall we need to sacrifice precision and lot of resources.In this work we try to explore ways of gathering Indian language tourism and health pages from the web for Sandhan using a language and domain specific focused crawler. With this setup we crawl the web extensively for Indian language tourism and health pages. We use different evaluation metrics to evaluate the quality of our crawl - precision, recall and harvest ratio. Using our approach we save nearly 80% resources (disk space, bandwidth, processing time) while maintaining a recall of 0.74 and 0.58 for tourism and health domains respectively.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.