GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets

Aurel Baloi,Bogdan Belean,Flaviu Turcu,Daniel Peptenatu

doi:10.1007/s00500-023-08687-8

Abstract

AbstractThe digital era brings up on one hand massive amounts of available data and on the other hand the need of parallel computing architectures for efficient data processing. String similarity evaluation is a processing task applied on large data volumes, commonly performed by various applications such as search engines, biomedical data analysis and even software tools for defending against viruses, spyware, or spam. String similarities are also used in musical industry for matching playlist records with repertory records composed of song titles, performer artists and producers names, aiming to assure copyright protection of mass-media broadcast materials. The present paper proposes a novel GPU-based approach for parallel implementation of the Jaro–Winkler string similarity metric computation, broadly used for matching strings over large datasets. The proposed implementation is applied in musical industry for matching playlist with over 100k records with a given repertory which includes a collection of over 1 million right owner records. The global GPU RAM memory is used to store multiple string lines representing repertory records, whereas single playlist string comparisons with the raw data are performed using the maximum number of available GPU threads and the stride operations. Further on, the accuracy of the Jaro–Winkler approach for the string matching procedure is increased using both an adaptive neural network approach guided by a novelty detection classifier (aNN) and a multiple-features neural network implementation (MF-NN). Thus, the aNN approach yielded an accuracy of 92% while the MF-NN approach achieved an accuracy of 99% at the cost of increased computational complexity. Timing considerations and the computational complexity are detailed for the proposed approaches compared with both the general-purpose processor (CPU) implementation and the state-of-the-art GPU approaches. A speed-up factor of 21.6 was obtained for the GPU-based Jaro–Winkler implementation compared with the CPU one, whereas a factor of 3.72 was obtained compared with the existing GPU implementation of string matching procedure based on Levenstein distance metrics.

Full Text

Published version (

Free)

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Soft Computing - A Fusion of Foundations, Methodologies and Applications	Publication Date: Jun 14, 2023
Citations: 2	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets

Abstract

Talk to us

Similar Papers

More From: Soft Computing - A Fusion of Foundations, Methodologies and Applications

Lead the way for us

Similar Papers

A String Similarity Evaluation for Healthcare Ontologies Alignment to HL7 FHIR Resources
Athanasios Kiourtis ... Argyro Mavrogiorgou
-
Athanasios Kiourtis, et. al.Athanasios Kiourtis ... Argyro Mavrogiorgou
01 Jan 2019
01 Jan 2019

Application of modified Levenshtein distance for classification of noisy business document images
Oleg Slavin ... Jianhong Zhou
-
Oleg Slavin, et. al.Oleg Slavin ... Jianhong Zhou
05 Mar 2022
05 Mar 2022

Learning to combine multiple string similarity metrics for effective toponym matching
Rui Santos ... Bruno Martins
International Journal of Digital Earth | VOL. 11
Rui Santos, et. al.Rui Santos ... Bruno Martins
06 Sep 2017
International Journal of Digital Earth | VOL. 11

Mmpp: A Package for Calculating Similarity and Distance Metrics for Simple and Marked Temporal Point Processes
Hideitsu Hino ... Ken Takano
The R journal | VOL. 7
Hideitsu Hino, et. al.Hideitsu Hino ... Ken Takano
01 Jan 2015
The R journal | VOL. 7

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

GPU-based similarity metrics computation and machine learning approaches for string similarity evaluation in large datasets

Abstract

Talk to us

Similar Papers

More From: Soft Computing - A Fusion of Foundations, Methodologies and Applications