Over the past decade more and more users of the Internet rely on the search engines to help them find the information they need. However the information they find depends to a large extent, on the ranking mechanism of the search engines they use. Not surprisingly it in general consists of a large amount of information that is completely irrelevant. Text summarization is a process of reducing the size of a text while preserving its information content. Text Summarization is an emerging technique for understanding the main purpose of any kind of documents. To visualize a large text document within a short duration and small area like PDA screen, summarization provides a greater flexibility and convenience. This research focuses on developing a statistical automatic text summarization approach, K-mixture probabilistic model, to enhancing the quality of summaries. Sentences are ranked and extracted based on their semantic relationships significance values. The objective of this research is thus to propose a statistical approach to text summarization. Keywords - Extraction, Keywords, Statistical approach, Text Summarization, Webpage. I. INTRODUCTION Finding out the information that users need from a large amount of data is a major problem of information retrieval .Search engine is certainly a useful tool for helping users of the Internet find the information they need quickly. Unfortunately, it, in general, consists of a great amount of information that is totally irrelevant. One of the problems is that useful information tends to spread over a large number of similar documents instead of being located in a single document, but it is extremely difficult to identify and retrieve them. Building a web document summarization system involves Researches in dependence analysis of webs document Clustering, automatic generating summarization and user interface. Most search engines use ranked lists to rank the importance of the return web pages in response to a user query so that the returned information is more relevant to whatever a user is looking for. However, the ranked lists are not summarized in term of topics and are not suitable for browsing task for a very simple reason. The returned information are not classified or categorized. In other words, the returned web pages are interleaved instead of appearing one after another in terms of its category. Thus, users need to waste a lot of time in filtering out all the irrelevant data— even if search engine providers put a lot of time and effort in developing more useful ranking mechanisms.
Read full abstract