Combining natural language processing techniques and algorithms LSA, word2vec and WMD for technological forecasting and similarity analysis in patent documents

João Marcos De Rezende,Izabella Martins Da Costa Rodrigues,Leandro Colombi Resendo,Karin Satie Komati

doi:10.1080/09537325.2022.2110054

João Marcos De Rezende, Izabella Martins Da Costa Rodrigues + Show 2 more

https://doi.org/10.1080/09537325.2022.2110054

Copy DOI

Abstract

ABSTRACT Keyword search is the most ordinary tool in patent offices; however, for more advanced research, free software is not presented on their websites. Thus, this paper has the purpose to provide a data-mining framework for patent documents, linking the natural language processing techniques and data analysis algorithms. The system has two main goals: the analysis of technological prospection and the evaluation of similarities among patents through titles and abstracts. For numerical experiments, we used the base of the US Patent and Trademark Office, with over a million documents. Analysing patents about TFT-LCD, Flash Memory and PDA, from 2010 to 2018, with S-Curve it was observed that the last two technologies decline. Using a cloud of words, it was possible to see the phone’s evolution, from 2010 to 2015. To evaluate the degree of similarity among patents, we investigated Latent Semantic Analysis (LSA), Word2vec, Word Mover’s Distance (WMD), in three different study cases. In addition, these methods were compared with the classical Jaccard index. Numerical results show that LSA and WMD obtained similar patent indications, and the Jaccard index presented different indications from the other three.

Full Text