Abstract
In the ever-expanding landscape of digital technologies, the exponential growth of data in information science and health informatics presents both challenges and opportunities, demanding innovative approaches to data curation. This study focuses on evaluating various feasible clustering methods within the Data Washing Machine (DWM), a novel tool designed to streamline unsupervised data curation processes. The DWM integrates Shannon Entropy into its clustering process, allowing for adaptive refinement of clustering strategies based on entropy levels observed within data clusters. Rigorous testing of the DWM prototype on various annotated test samples revealed promising outcomes, particularly in scenarios with high-quality data. However, challenges arose when dealing with poor data quality, emphasizing the importance of data quality assessment and improvement for successful data curation. To enhance the DWM’s clustering capabilities, this study explored alternative unsupervised clustering methods, including spectral clustering, autoencoders, and density-based clustering like DBSCAN. The integration of these alternative methods aimed to augment the DWM’s ability to handle diverse data scenarios effectively. The findings demonstrated the practicability of constructing an unsupervised entity resolution engine with the DWM, highlighting the critical role of Shannon Entropy in enhancing unsupervised clustering methods for effective data curation. This study underscores the necessity of innovative clustering strategies and robust data quality assessments in navigating the complexities of modern data landscapes. This content is structured by the following sections: Introduction, Methodology, Results, Discussion, and Conclusion.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.