In recent years, due to the explosive growth of patent applications, patent mining has drawn extensive attention and interest. An important issue of patent mining is that of recognizing the technologies contained in patents, which serves as a fundamental preparation for deeper analysis. To this end, in this article, we make a focused study on constructing a technology portrait for each patent, i.e., to recognize technical phrases concerned in it, which can summarize and represent patents from a technical perspective. Along this line, a critical challenge is how to analyze the unique characteristics of technical phrases and illustrate them with definite descriptions. Therefore, we first generate the detailed descriptions about the technical phrases existing in extensive patents based on different criteria, including various previous works, practical experience, and statistical analyses. Then, considering the unique characteristics of technical phrases and the complex structure of patent documents, such as multi-aspect semantics and multi-level relevances, we further propose a novel unsupervised model, namely TechPat, which can not only automatically recognize technical phrases from massive patents but also avoid the need for expensive human labeling. After that, we evaluate the extraction results from various aspects. Specifically, we propose a novel evaluation metric called Information Retrieval Efficiency (IRE) to quantify the performance of extracted technical phrases from a new perspective. Extensive experiments on real-world patent data demonstrate that the TechPat model can effectively discriminate technical phrases in patents and greatly outperform existing methods. We further apply extracted technical phrases to two practical application tasks, namely patent search and patent classification, where the experimental results confirm the wide application prospects of technical phrases. Finally, we discuss the generalization ability of our proposed methods.
Read full abstract