Abstract

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.

Highlights

  • Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining

  • All measures get the highest percent when the original text was compared to the D17, which proves that this text is a near copy

  • The approach was based on the text splitting into n-grams and evaluating it using a self-organizing map (SOM) and similarity measures

Read more

Summary

Introduction

Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining. The main technique to detect similarity between texts is to extract the bag of words from all text datasets. The advantage of this method compared with other clustering methods is that we can get a visual representation of all texts in a dataset, cluster, as well as similarities. It helps to make decisions much quicker than analyzing numerical estimation. We extract word-level n-grams of different length from texts and analyze them. The experimental investigation was made using a corpus of plagiarized short answers dataset

Proposed Approach to Evaluate Text Similarity
Preparation of Frequency Matrix
Self-Organizing Maps
Measures for Text Similarity Detection
Dataset
Steps of the Experiment
Experimental Results
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call