Analyzing semantic similarity amongst textual documents to suggest near duplicates

Viji Devarajan,Revathy Subramanian

doi:10.11591/ijeecs.v25.i3.pp1703-1711

Abstract

Data deduplication techniques removing repeated or redundant data from the storage. In recent days, more data has been generated and stored in the storage environment. More redundant and semantically similar content of the data occupied in the storage environment due to this storage efficiency will be reduced and cost of the storage will be high. To overcome this problem, we proposed a method hybrid bidirectional encoder representation from transformers for text semantics using graph convolutional network hybrid bidirectional encoder representation from transformers (BERT) model for text semantics (HBTSG) word embedding-based deep learning model to identify near duplicates based on the semantic relationship between text documents. In this paper we hybridize the concepts of chunking and semantic analysis. The chunking process is carried out to split the documents into blocks. Next stage we identify the semantic relationship between documents using word embedding techniques. It combines the advantages of the chunking, feature extraction, and semantic relations to provide better results.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Indonesian Journal of Electrical Engineering and Computer Science	Publication Date: Mar 1, 2022
Citations: 2	License type: CC BY-NC 4.0

R Discovery Prime

R Discovery Prime

Analyzing semantic similarity amongst textual documents to suggest near duplicates

Abstract

Talk to us

Similar Papers

More From: Indonesian Journal of Electrical Engineering and Computer Science

Lead the way for us

Similar Papers

Bidirectional encoders to state-of-the-art: a review of BERT and its transformative impact on natural language processing
Rajesh Gupta
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3
Rajesh GuptaRajesh Gupta
02 Mar 2024
Информатика. Экономика. Управление - Informatics. Economics. Management | VOL. 3

Semantic Textual Similarity in Japanese Clinical Domain Texts Using BERT.
Shoko Wakamiya ... Faith Wavinya Mutinda
Methods of Information in Medicine | VOL. 60
Shoko Wakamiya, et. al.Shoko Wakamiya ... Faith Wavinya Mutinda
01 Jun 2021
Methods of Information in Medicine | VOL. 60

Refining Semantic Similarity of Paraphasias Using a Contextual Language Model.
Alexandra C Salem ... Steven Bedrick
Journal of speech, language, and hearing research : JSLHR | VOL. 66
Alexandra C Salem, et. al.Alexandra C Salem ... Steven Bedrick
09 Dec 2022
Journal of speech, language, and hearing research : JSLHR | VOL. 66

Oversampling effect in pretraining for bidirectional encoder representations from transformers (BERT) to localize medical BERT and enhance biomedical BERT
Shoya Wada ... Yasushi Matsumura
Artificial Intelligence In Medicine | VOL. 153
Shoya Wada, et. al.Shoya Wada ... Yasushi Matsumura
05 May 2024
Artificial Intelligence In Medicine | VOL. 153

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Analyzing semantic similarity amongst textual documents to suggest near duplicates

Abstract

Talk to us

Similar Papers

More From: Indonesian Journal of Electrical Engineering and Computer Science