The present paper describes a Bayesian network approach to Information Retrieval (IR) from Web documents. The network structure provides an intuitive representation of uncertainty relationships and the embedded conditional probability table is used by inference algorithms in an attempt to identify documents that are relevant to the user's needs, expressed in the form of Boolean queries. Our research has been directed in constructing a probabilistic IR framework that focus on assisting users to perform Ad-hoc retrieval of documents from the various domains such as economics, news, sports, etc. Furthermore, users can integrate feedback regarding the relevance of the retrieved documents in an attempt to improve performance on upcoming requests. Towards these goals, we have expanded the traditional Bayesian network IR system and tested it on several Greek web corpora on different application domains. We have developed two different approaches with regards to the structure: a simple one, where the structure is manually provided, and an automated one, where data mining is used in order to extract the network's structure. Results have depicted competitive performance against successful IR models of different theoretical backgrounds, such as the vector space utilizing tf-idf and the probabilistic model of BM25 in terms of precision-recall curves. In order to further improve the performance of the IR system, we have implemented a novel similarity-based lemmatization framework, reducing thus the ambiguity posed by the plethora of morphological variations of the languages in question. The employed lemmatization framework comprises of 3 core components (i.e. the word segregation, the data cleansing and the lemmatization modules) and is language-independent (i.e. can be applied to other languages with morphological peculiarities and thus improve Ad-hoc retrieval) since it achieves the mapping of an input word to its normalized form by employing two state-of-the-art language independent distance metric models, meaning the Levenshtein Edit distance and the Dice coefficient similarity measure, combined with a language model describing the most frequent inflectional suffixes of the examined language. Experimental results support our claim on the significance of this incorporation to Greek texts web retrieval as results improve by a factor of 4% to 11%.
Read full abstract