Abstract

Data profiling is an abnormality in the search process data that undermine the value of the data. Type of data profiling analysis techniques to detect duplication of text within a single paragraph can by using a text cluster. In this paper aims at doing text algorithm transformation cluster with fingerprint method using Pentaho Data Integration (PDI). As for methods used to conduct implementation text cluster with Pentaho Data Integration is to do a mapping algorithm fingerprint method on component contained in Pentaho, then perform a transformation per component, and further evaluation of the results of a text cluster with open source data profiling tool. Implementation of text clusters with Pentaho Data Integration successfully done but there is still some rudimentary logic. The results from implementation of clusters number on the transformation by using Pentaho Data Integration is greater than a number of text clusters by using the OpenRefine tool.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.