The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

Pavel Stefanovič,Olga Kurasova,Rokas Štrimaitis

doi:10.3390/app9091870

Pavel Stefanovič, Olga Kurasova + Show 1 more

Open Access

https://doi.org/10.3390/app9091870

Copy DOI

Abstract

In the paper the word-level n-grams based approach is proposed to find similarity between texts. The approach is a combination of two separate and independent techniques: self-organizing map (SOM) and text similarity measures. SOM’s uniqueness is that the obtained results of data clustering, as well as dimensionality reduction, are presented in a visual form. The four measures have been evaluated: cosine, dice, extended Jaccard’s, and overlap. First of all, texts have to be converted to numerical expression. For that purpose, the text has been split into the word-level n-grams and after that, the bag of n-grams has been created. The n-grams’ frequencies are calculated and the frequency matrix of dataset is formed. Various filters are used to create a bag of n-grams: stemming algorithms, number and punctuation removers, stop words, etc. All experimental investigation has been made using a corpus of plagiarized short answers dataset.

Highlights

Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining
All measures get the highest percent when the original text was compared to the D17, which proves that this text is a near copy
The approach was based on the text splitting into n-grams and evaluating it using a self-organizing map (SOM) and similarity measures

Summary

Introduction

Nowadays text mining can be used in different practice areas [1] but the most common are: information extraction and retrieval, text classification and clustering, natural language processing, concept extraction, and web mining. The main technique to detect similarity between texts is to extract the bag of words from all text datasets. The advantage of this method compared with other clustering methods is that we can get a visual representation of all texts in a dataset, cluster, as well as similarities. It helps to make decisions much quicker than analyzing numerical estimation. We extract word-level n-grams of different length from texts and analyze them. The experimental investigation was made using a corpus of plagiarized short answers dataset

Proposed Approach to Evaluate Text Similarity

Preparation of Frequency Matrix

Self-Organizing Maps

Measures for Text Similarity Detection

Dataset

Steps of the Experiment

Experimental Results

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Applied sciences	Publication Date: May 7, 2019
Citations: 19	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences

Lead the way for us

Similar Papers

Efficient Hybrid Semantic Text Similarity using Wordnet and a Corpus
Issa Atoum ... Ahmed Otoom
International Journal of Advanced Computer Science and Applications | VOL. 7
Issa Atoum, et. al.Issa Atoum ... Ahmed Otoom
01 Jan 2015
International Journal of Advanced Computer Science and Applications | VOL. 7

Visualizing text similarities from a graph-based SOM
Khalid Kahloot ... Akram A Elkhatib
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY | VOL. 14
Khalid Kahloot, et. al.Khalid Kahloot ... Akram A Elkhatib
12 May 2015
INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY | VOL. 14

A Fuzzy Approach for Ambiguity Reduction in Text Similarity Estimation (Case Study: Persian Web Contents)
...
-
, et. al. ...
24 Dec 2015
24 Dec 2015

Short Answer Scoring in English Grammar Using Text Similarity Measurement
Akeem Olowolayemo ... Teddy Mantoro
-
Akeem Olowolayemo, et. al.Akeem Olowolayemo ... Teddy Mantoro
01 Sep 2018
01 Sep 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

The N-Grams Based Text Similarity Detection Approach Using Self-Organizing Maps and Similarity Measures

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Applied sciences