Today, a historically unprecedented volume of data is available in the public domain with the potential of becoming useful for researchers. More than at any other time before, political parties and governments are making data available such as speeches, legislative bills and acts. However, as the size of available data increases, the need for sophisticated tools for web-harvesting and data analysis simultaneously grows. Yet, for the most part researchers who are developing these tools come from a computer science background, while researchers in the social and behavior sciences who have an interest in using such tools often lack the necessary training to apply these tools themselves.In order to provide a bridge between these two communities we propose a new tool called PolicyMiner. The objective of this tool is twofold: First, to provide a general purpose web-harvesting and data clean-up tool which can be used with relative ease by researchers with limited technical backgrounds. The second objective is to implement knowledge discovery algorithms that can be applied to textual data, such as legislative acts. With our paper we present a technical document which details the steps of data processing that have been implemented in the PolicyMiner. First, the PolicyMiner harvests the raw html data from publically available websites, such as governmental sites, and provides a unique integrated view for the data. Second, it cleans the data by removing irrelevant items, such as html tags and non-informative terms. Third, it classi es the harvested data according to a pre-de ned standard conceptual hierarchy relying on the Eurovoc thesaurus. Fourth, it applies di fferent knowledge discovery algorithms such as time series and correlation-based analysis to capture the temporal and substantive policy dependencies of the textual data across countries.
Read full abstract