The exponential growth of textual data poses a monumental challenge for extracting meaningful knowledge. Manually identifying descriptive keywords or keyphrases for each document is infeasible given the massive daily generated text. Automatic keyphrase extraction is, therefore, essential. However, current techniques struggle with learning the most salient semantic features from lengthy documents. This hybrid keyphrase extraction framework uniquely combines the complementary strengths of graph-based and textual feature methods. Our approach demonstrates improved performance over relying solely on statistical or graphical. Graph-based systems leverage word co- occurrence networks to score importance. Textual methods extract keyphrases using linguistic properties. Together, these complementary techniques overcome the limitations of relying on any strategy. The hybrid approach is evaluated on standard SemEval 2017 Task 10 and SemEval 2010 Task 5 benchmark datasets for scientific paper keyphrase extraction. Performance is quantified using the F1 score relative to human-annotated ground truth keyphrase. Results will quantify effectiveness on long documents with thousands of terms where only a few keywords represent salient concepts. Results show our technique effectively identifies the most salient semantic keywords, overcoming limitations of current techniques that struggle to mix features of graphical or statistical methods. Our experiments demonstrate that the proposed hybrid approach achieves superior F1 scores compared to current state-of-the-art methods on benchmark datasets. These results validate that synergistically combining graph and textual features enables more accurate keyphrase extraction, especially for long documents laden with extraneous terms.
Read full abstract