Abstract

In the modern era of industrialization, for every single second, enormous huge amount of data /corpus is generated. Not all the data generated is useful. As we see, when we crawl data from various websites or social media or newspaper, we get the lot of repeated type of data and finally we end up with lot of repeated data and store in a database as the resultant size of data goes on increasing and increasing. To avoid or overcome such type of problems, we are going to explain how to solve such problems by using a fuzzy match algorithms to predict whether this data is already available or not in a files or in the storage. In a nut shell, this paper is going to explain in details how to use the comparison between the different fuzzy algorithm in data matching so that it will help us in reducing the data storage and minimize our time in downloading the unique data. The fuzzy applies when we do not find exact match which is not a 100% match but relatable. This feature helps to give suggestions in search engines, in spell checker and many more. As we know by the year 2025 by CAGR prediction, the market size of data going to increase more than 30 Billion and this data is raw data which need to be aligned for other NLP or AI process which can be used in business understanding, data understanding and modelling, data preparation, etc. The Fuzzy matching approach, also called the probabilistic record linkage, is able to calculate the probability of various records which somehow means same or near around for the input string provided by user which is returned from the translation memory of data storage or a file. Our paper discusses this proposed model which applies the concept of fuzzy matching to make searching of strings easier in English language; the same approach can extended for all European and Indian languages.KeywordsFuzzy matchLevenshteinCosineOCRBoyer MooreSoundexBitapJaro WinklerData Alignment

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call