Text mining: identification of similarity of text documents using hybrid similarity model

K M Shiva Prasad

doi:10.1007/s42044-022-00127-4

Abstract

The volume of data that are accessible on the internet has increased dramatically. This growth of data will only increase exponentially in the future as more data exhaust devices are connected to the network. A part of these data consists of documents from various sources. As the data from various digital sources increase, it becomes tough to perform the process of identification of relevant information which is most essentially needed for their further usage. The goal of this research is to present a hybrid similarity algorithm that uses text summarization techniques to identify papers that are similar in terms of both semantic and contextual similarity. Some of these methods aim to quantify the corpus’s polysemy quotient using deep learning with numerous layers and prebuilt Natural Language Processing (NPL) models to determine document similarity. In comparison with other conventional algorithms, the experimental results of our model showed an accuracy of 76.25%.

Full Text