Introduction Web usage log files generated on web servers contain huge amount of information suitable for applying data mining methods to discover potentially useful knowledge (Kosala & Blockeel, 2000; Wang, Li, & Zhang, 2005). Discovering web usage association rules is one of the popular data mining methods that can be applied on the web usage log data. The information contained in association rules can be used to learn about website visitor behaviour patterns, enhance website structure making it more effective for the visitors, or improve web marketing campaigns (Anand, Mulvenna, & Chavielier, 2004; Cooley, Mobasher, & Srivastava, 1997). Originally, association rule mining algorithms were applied to the analysis of transactional databases (Agrawal, Imielinski, & Swami, 1993; Brin, Motwani, Ullman, & Tsur, 1997). When evaluating the association rule interestingness, various measures can be used to help find the rules that give maximally useful information to the user (Geng & Hamilton, 2006; Tan, Kumar, & Srivastava, 2004). Some of the proposed association rule interestingness measures are all-confidence (Omiecinski, 2003), collective strength (Aggarwal & Yu, 1998), conviction and lift (Brin et al., 1997). While association rule finding algorithms are complete in that they find all rules that satisfy defined constraints, they often result in a huge set of rules that is difficult to exploit in order to find those rules that are truly interesting to the user (Liu, Hsu, & Ma, 1999). This problem is aggravated in association rule mining of the web usage log data (Huang, Cercone, & Aijun, 2002). Web usage data is specific and differs from the market basket data in the way that it contains a large number of tightly correlated items (web resources or web pages) due to the link structure of a website. Web pages that are tightly linked together often occur in the same visitor sessions, which is a reason that the generated set of association rules contains a huge number of so called hard association rules, which are not truly interesting to a user (Huang, 2007). There are freely available data mining software systems that can be used for discovering association rules in the web log usage data (Weka 3, n.d.). Weka is a great data mining tool with a wide range of features (Witten, Frank, & Hall, 2011). We used Weka in our previous research (Dimitrijevic & Bosnjak, 2010), but ran into several obstacles. We found that Weka did not contain an integrated set of tools to support all phases of web usage mining. A special software had to be used for web log preprocessing, such as WUM Prep scripts. Further, we had to develop our own special purpose converters of the web log preprocessed data into the Weka's ARFF file format before running web usage association rule discovery algorithm in Weka. Other limitations of the current Weka version are that it offers four association rule interestingess measures without option of combining them. Weka can be extended to include more interestingness measures, but it would involve additional work. There is also no support for automatic pruning of the discovered association rules in Weka using the methods we proposed (this was confirmed to us by the Weka architect in a forum). We concluded that it would give us more flexibility in our research and ability to apply the methods we propose for pruning association rules and for combining various interestingness measures if we used our own independent tool specialized for Web Usage Mining. We present the first version of this tool in this paper. We apply the system for the discovery of the association rules on a real life data set and present the results using various parameter values. We propose a method to alleviate the problem of over-generation of not truly interesting rules in web usage mining by eliminating the rules that contain directly linked pages. …
Read full abstract