World Wide Web (WWW) which is predominant source for Information Retrieval today (IR) is essentially a set of hyperlinked documents. A web page containing more number of related hyperlinks satisfy the user needs in a single page. The IR systems should give high priority to such web pages. While assigning a rank for a web page, existing web mining techniques such as Hypertext Induced Topic Selection (HITS) and Page Ranking algorithms focus on the number of in links and out links present in the web page. Instead of just relying on the number of links present in the web page, the discovery of semantic relations between the web page and the hyperlinks present in the web page can improve the quality of the IR systems. The Rhetorical Structure Theory (RST) is widely used to find the semantic relations between text fragments by analysing the discourse structure of a text. In this paper, we propose a novel approach to find the semantic relation between a web page and the links present in the web page using RST. The proposed approach uses RST based discourse relations to find the relation between a web page and the hyperlinks present in the web page. We have implemented and evaluated our approach on an IR system using 500 Tamil language and 50 English tourism domain specific web pages. A comparison between the proposed approach and an existing page ranking algorithm has also been done.
Read full abstract