Abstract

A new key phrases extraction method based on statistical approaches and contextual embedding models is proposed. The proposed method identifies the main candidate keywords for a document based on their frequencies and then weights them based on their similarity values to the document. The frequency of each type of n-gram is calculated independently to ensure fair competition between them at the ranking stage. In addition, to assess the importance of a candidate keyword in a document, context-based embeddings are used to represent both the document and its candidate keywords in preparation for assessing their similarity. To prevent bias in similarity values in long keywords, a function has been introduced to normalize similarity scores of each candidate based on its length. The final score for the importance of a keyword is determined by a score function that returns the harmonic mean of the frequency and the similarity. The proposed method is tested using five different benchmark datasets against several state-of-art baselines including TF-IDF, TextRank, and Yake. The proposed method outperforms all baselines in terms of the MAP@K ranking metric.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call