Identification of Similar Strings in a Dataset using Scalable Join

Archana S Vaidya,Khalid F.Alfatmi

doi:10.5120/21329-4295

Identification of Similar Strings in a Dataset using Scalable Join

Archana S Vaidya, Khalid F.Alfatmi

Open Access

PDF Available

https://doi.org/10.5120/21329-4295

Copy DOI

Export

Save

Cite

Journal: International Journal of Computer Applications

Publication Date: Jun 18, 2015

#Role In Data Integration #Collections Of Strings #Similarity Joins #MapReduce Concept #Data De-duplication #Record Linkage #Global Ordering #Data Integration #Large String #Data Cleansing

Abstract
Full-Text PDF
Similar Papers

Abstract

Listen

Similarity Join plays an important role in data integration and cleansing, record linkage and data de-duplication. It finds similar sting pairs from collections of strings. If two strings are similar they share a common token. Number of approaches has been proposed for in-memory string similarity joins. But due to the rising era of big data, demands for scalable algorithms to support large scale string similarity joins arises. The proposed architecture uses the MapReduce concept and is based on inverted index and multiple prefix filtering methods. The prefix filtering is made of different global ordering which reduces the number of candidate pairs significantly, thus improving the pruning power as compared to other approach.

Full Text

Published Version (Free)

View/Download pdf

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Similar Papers

Paper Title

Journal

Date

Author

View more papers

More From: International Journal of Computer Applications

Paper Title

Journal

Date

Author

View more papers

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.

R Discovery Prime

Identification of Similar Strings in a Dataset using Scalable Join