The vigorous development of patent applications in recent years provides an opportunity to unveil the inherent laws of innovation, but it also puts forward higher requirements for patent mining technology. An essential step for patent text mining is to establish a technology portrait for each patent, that is, identify the technical phrases involved, which can be summarized and represented by the patent from the technical point of view. Currently, there is a large body of work focusing on keyword extraction. However, technical phrase extraction differs from keyword extraction due to the unique properties of technical phrases. Specifically, technical phrases must contain rich technical information and are essential to the entire patent text from a technical perspective. Meanwhile, finding potential relationships between phrases with different technical meanings is challenging for technical phrase extraction. Based on the analysis of the characteristics of technical phrases, we found that the position of technical phrases in the patent text and the structural relationship between technical phrases are crucial, and how to make good use of these two pieces of information is a challenge. Motivated by this, we propose a new Unsupervised Technical phrase Extraction model from the Structure and Position information perspective, named UTESP. Specifically, UTESP includes four key steps: candidate generation, graph construction, candidate score, and candidate selection. The structure information refers to adjusting the incoming edge weight of candidate phrases through the distance relations between candidate phrases and applying the graph ranking algorithm to obtain the structure score of the candidate phrase. The position information simultaneously incorporates the position and frequency of candidate phrases in the patent text to calculate a position score for candidate technical phrases. The effectiveness of our framework has been demonstrated by comparing with seven competitive algorithms on the patent datasets in terms of three evaluation metrics: Precision, Recall, and F1 scores. Besides, our new framework indicated significant improvements in the representation ability of technical phrases by comparing Information Retrieval Efficiency (IRE) with competitive algorithms.
Read full abstract