Abstract

It has been estimated that the amount of information in the world doubles every twenty months. Today, corporate, governmental, and scientific communities are being overwhelmed with an influx of data increasingly available in semi-structured and textual forms from the World Wide Web or via an Intranet. Knowledge discovery or data mining is the emerging field that aims at analysing massive amount of data and extracting meaningful and comprehensible patterns, called knowledge. This thesis examines the problems associated with knowledge discovery, focusing specifically on issues arising from the construction of a categorical classifier using distributed and textual data sources. Textual documents, generally speaking, contain richer and complementary information to merely numerical data. Thus textual documents provide better resources for data mining provided there are techniques available able to fully exploit textual content. Potentially, text mining can extract more useful and relevant knowledge than traditional numeric data mining techniques. There are at least four major steps in any knowledge discovery process: information selection, data preprocessing, mining, and interpretation of the findings. Selecting high quality data sources and features is the first problem addressed in this study. The study suggests the notion of data quality. Data quality is independent of a particular classifier and is a proxy for the ability to solve a classification problem using pertinent data sources. Experimental results show that this quality notion does have positive correlation with the corresponding solution accuracy. In the data preprocessing stage, we investigate ways of transforming unstructured textual data into structured feature weightings. In contrast to the methods presented in the literature, our transformed methods are specifically tailored towards solving categorical prediction problems. As far as mining is concerned, a common situation is that there are many data sources from which to mine knowledge. One possibility is to bring together all the data sources to form a large single database. However, not only is this a costly proposition but also an infeasible one. The second possibility is to mine only one single data source and then to assume that the found knowledge is also applicable to all the other data sources. This however might bias the result. We therefore investigate a third way of mining knowledge. From each data source, only one rule is generated. Individual rules are then brought together to represent adequately the knowledge of the complete data. Finally, we suggest and compare various ways of combining individual categorical predictions to come up with a consensus prediction. This allows distributed predictions, that is, the generation of one opinion from each data source and combination of all the opinions to form the final result. Most of the experiments are carried out on the application of predicting the Hang Seng Index, Hong Kong's stock market index, based on financial news available on from web sources such as the Wall Street Journal and CNN. In addition, some publicly available benchmark data sets are used to verify the proposed techniques. (Abstract shortened by UMI.)

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call