Abstract

Data profiling is an abnormality in the search process data that undermine the value of the data. Type of data profiling analysis techniques to detect duplication of text within a single paragraph can by using a text cluster. In this paper aims at doing text algorithm transformation cluster with fingerprint method using Pentaho Data Integration (PDI). As for methods used to conduct implementation text cluster with Pentaho Data Integration is to do a mapping algorithm fingerprint method on component contained in Pentaho, then perform a transformation per component, and further evaluation of the results of a text cluster with open source data profiling tool. Implementation of text clusters with Pentaho Data Integration successfully done but there is still some rudimentary logic. The results from implementation of clusters number on the transformation by using Pentaho Data Integration is greater than a number of text clusters by using the OpenRefine tool.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call