Development of an Embedding Framework for Clustering Scientific Papers

Songhee Kim,Suyeong Lee,Byungun Yoon

doi:10.1109/access.2022.3160826

Abstract

In this era, research and development are becoming a continuous and accelerating process because technology changes rapidly with a short lifecycle. As a result, various methodologies are being developed to monitor these rapidly changing research trends; In particular, clustering method-related studies in science and technology documents are being developed with a variety of approaches. However, previous studies on document clustering methods focus on a specific field or language but do not take into consideration certain important pieces of information in science and technology documents. Therefore, this study proposes an embedding methodology that uses important content from scientific and technical documents. We took into consideration the importance of information containing core structures in science and technology documents and proposed a clustering methodology that analyzes structured and unstructured data, such as textual information, author information, and citation information. The proposed method combines both textual and structural data from the paper, using a method that focuses on screening important information by sections in science and technology documents. Then, Girvan-Newman clustering and Louvain clustering models are applied to generate embedding vectors and show evaluation results through the clustering indices. As a practical example, we applied the proposed methodology using paper data from the field of hydrogen cell vehicles. The results of this study will be effective in identifying gaps in technology for new technological development, identifying technology trends, and presenting directional information for future technology development.

Full Text