A patent keywords extraction method using TextRank model with prior public knowledge

Zhaoxin Huang,Zhenping Xie

doi:10.1007/s40747-021-00343-8

Abstract

For large amount of patent texts, how to extract their keywords in an unsupervised way is a very important problem. In existing methods, only the own information of patent texts is analyzed. In this study, an improved TextRank model is proposed, in which prior public knowledge is effectively utilized. Specifically, two following points are first considered: (1) a TextRank network is constructed for each patent text, (2) a prior knowledge network is constructed based on public dictionary data, in which network edges represent the prior interpretation relationship among all dictionary words in dictionary entries. Then, an improved node rank value evaluation formula is designed for TextRank networks of patent texts, in which prior interpretation information in prior knowledge network are introduced. Finally, patent keywords can be extracted by finding top-k node words with higher node rank values. In our experiments, patent text clustering task is used to examine the performance of proposed method, wherein several comparison experiments are executed. Corresponding results demonstrate that, new method can markedly obtain better performance than existing methods for patent keywords extraction task in an unsupervised way.

Highlights

For more and more patent texts, how to mine their contents to effectively obtain valuable patent information has aroused widespread concern [1, 2]
An improved TextRank model is proposed in this study by introducing a prior knowledge network, which is called as PrTextRank in this text
For Convolution neural network (CNN), 3000 patent texts in dataset I are divided into training data and test data with 80% and 20%

Summary

Introduction

For more and more patent texts, how to mine their contents to effectively obtain valuable patent information has aroused widespread concern [1, 2]. Patent contents can be well represented by some key term words, called patent keywords. These keywords can be widely used in text mining such as automatic summary generation [3], patent novelty discovery [4], text clustering and classification [5, 6]. For existing TextRank methods, they use the PageRank [9] formula to calculate node rank values based on the cooccurrence relationship among all possible words. These methods did not consider the original differences of term node importance over public common knowledge. An improved TextRank model is proposed in this study by introducing a prior knowledge network, which is called as PrTextRank in this text

Methods

Results

Conclusion