Abstract
The ever-increasing availability of online textual data bases and the development of Internet have necessitated intensive research in the area of automatic text summarization within the Natural Language Processing (NLP) community. Researchers and students constantly face the problem when they work on a research project that, it is almost impossible to read most of the newly published papers. The goal of text summarization based on extraction approach is sentences selection. One of the methods to obtain the sentences is to assign some feature terms of sentences for the summary called ranking sentences and then select the best ones. Broad indexing and speedy search alone are not enough for effective retrieval. Categorized data are easy for user to browse if the data is well organized. In the first stage each document is prepared by preprocessing process: sentence segmentation, tokenization, stop word removal, case folding, lemmatization, and stemming. Then, we used important features, sentence filtering features, data compression features and finally calculate their score for each sentence. We proposed text summarization based on HMM tagger to improve the quality of the summary. By creating impressions the documents are also categorized. We compared our results with the Copernicus summarizer, Great summarizer and Microsoft Word 2007 summarizers etc. The proposed system is tested with four types' similarities: Cosine, Jaccard, Jaro-winkler and Sorenson similarities. The results show that the best quality for the summaries was obtained by feature terms method. Our text categorization approach is validated with Naive Bayesian, Decision Tree Induction, KNN and SVM approaches.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.