Abstract

The effectiveness of automatic key concept or keyphrase identification from unstructured text documents mainly depends on a comprehensive and meaningful list of candidate features extracted from the documents. However, the conventional techniques for candidate feature extraction limit the performance of keyphrase identification algorithms and need improvement. The objective of this paper is to propose a novel parse tree-based approach for candidate feature extraction to overcome the shortcomings of the existing techniques. Our proposed technique is based on generating a parse tree for each sentence in the input text. Sentence parse trees are then cut into sub-trees to extract branches for candidate phrases (i.e., noun, verb, and so on). The sub-trees are combined using parts-of-speech tagging to generate the flat list of candidate phrases. Finally, filtering is performed using heuristic rules and redundant phrases are eliminated to generate final list of candidate features. Experimental analysis is conducted for validation of the proposed scheme using three manually annotated and publicly available data sets from different domains, i.e., Inspec, 500N-KPCrowed, and SemEval-2010. The proposed technique is fine-tuned to determine the optimal value for the parameter context window size and then it is compared with the existing conventional n-gram and noun-phrase-based techniques. The results show that the proposed technique outperforms the existing approaches and significant improvements of 13.51% and 30.67%, 12.86% and 5.48%, and 13.16% and 31.46% are achieved, in terms of precision , recall , and F-measure when compared with noun-phrase-based scheme and n-gram-based scheme, respectively. These results give us confidence to further validate the proposed technique by developing a keyphrase extraction algorithm in the future.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.