Novel self-learning based crawling and data mining for automatic information extraction

Arun Kumar A V Arun Kumar A V,Hemant Kumar Rath,Shameemraj M Nadaf,Anantha Simha

doi:10.1109/icacci.2015.7275698

Abstract

In this paper, we propose techniques using a novel combination of self-learning based crawling and rule based data mining. Using the crawling techniques smaller relevant data sets can be obtained pertaining to a domain from multi-dimensional data sets available in on-line as well as off-line sources. We then process the crawled data sets and mine to extract meaningful information. Our techniques are generic in nature and can be used for automatic information extraction in different domains such as biomedical, health-care, enterprise infrastructure planning, etc. The proposed schemes are of reduced time, space and processor complexity due to the assisted and learning nature of the crawling. The data mining is based on configurable classification rules and decision trees, which are scalable and easy to implement in practice. We evaluate our proposed techniques through Java based implementation and integration with TCS in-house enterprise network design tool NetDes.

Full Text