Combining Text Mining and Information Retrieval Techniques for Enhanced Access to Statistical Data on the Web: A Preliminary Report

Martin Rajman,Martin Vesely

doi:10.1007/3-540-32394-5_16

Abstract

In this contribution, we present the StatSearch prototype, a search engine that enables an enhanced access to domain specific data available on the Web. The StatSearch engine proposes a hybrid search interface combining query-based search with automated navigation through a tree-like hierarchical structure. The goal of such an interface is to allow a more natural and intuitive control over the information access process, thus improving the speed and quality of the access to information.An algorithm for automated navigation is proposed that requires natural language pre-processing of the documents, including language identification, tokenization, Part-of-Speech (PoS) tagging, lemmatization and entity extraction. Structural transformation of the available data collection is also performed to reorganize the nodes in the information space (the Web site) from a graph into a tree-like hierarchical structure. This structural pre-processing (transformation of a graph structure into a tree-like hierarchy) can be done either by document clustering, or, alternatively, derived from existing structure of the document collection by splitting, shifting, or merging of nodes where necessary. The clustering approach is more straightforward but requires that the intermediate nodes in the created tree are assigned understandable descriptions, which corresponds to a difficult task.Target documents are represented by weighted lexical profiles the components of which correspond to triples of the form (surface form, lemma, PoS). The extracted and normalized terms and entities are weighted using the TF.IDF weighting scheme. Document relevance is computed as the textual similarity between the query and document profiles. Several well known similarity functions from the field of information retrieval have been tested, including the Cosine and Okapi BM25 similarity measures. In addition to the similarity score, the contributions of all the query terms to the computed document similarities are also provided.The principle of the presented algorithm for automated navigation is to compute a score distribution on the documents (leaves of the tree), and to propagate the obtained scores upwards in the tree. The node scores are then used to guide a faster, partially automatic, downward navigation in the tree. In particular, user intervention for node selection is only required for nodes with children corresponding to a score distribution where no clearly good candidate can be identified. Otherwise, the (possible partial) traversal of the tree is performed automatically. Several approaches are compared for the automation of the navigation. They include decision rules based on relative (resp. absolute) minimum best score differences, as well as on information theoretic measures. The automated navigation algorithm also allows a more reliable document ranking by giving to the user the possibility to restrict the search to the set of documents dominated by a specific node or to the documents matching a limited set of document types.The presented hybrid search technique has been implemented in the StatSearch prototype that has been realized in collaboration between EPFL, Statistics Sweden (SCB), and CERN, in the framework of the NEMIS network of excellence. The prototype focuses on domain of official statistics, and currently uses a database of over 5000 full text documents, tables and graphs in English accessible at the SCB Web site.

Full Text