Abstract
Big data analytics and data mining are techniques used to analyze data and to extract hidden information. Traditional approaches to analysis and extraction do not work well for big data because this data is complex and of very high volume. A major data mining technique known as data clustering groups the data into clusters and makes it easy to extract information from these clusters. However, existing clustering algorithms, such as k-means and hierarchical, are not efficient as the quality of the clusters they produce is compromised. Therefore, there is a need to design an efficient and highly scalable clustering algorithm. In this paper, we put forward a new clustering algorithm called hybrid clustering in order to overcome the disadvantages of existing clustering algorithms. We compare the new hybrid algorithm with existing algorithms on the bases of precision, recall, F-measure, execution time, and accuracy of results. From the experimental results, it is clear that the proposed hybrid clustering algorithm is more accurate, and has better precision, recall, and F-measure values.
Highlights
Big data is currently generating a buzz in the market and data is rapidly growing from being measured in gigabytes to terabytes, petabytes, and zetabytes[1]
We propose a new hybrid clustering technique that combines the workings of earlier clustering algorithms
To implement the proposed hybrid clustering technique in Hadoop[24], we chose a dataset of the National Climatic Data Center (NCDC), containing the world’s largest active archive of weather data[25]
Summary
Big data is currently generating a buzz in the market and data is rapidly growing from being measured in gigabytes to terabytes, petabytes, and zetabytes[1]. Big data has such large data requirements that applications that were previously used to store and process data— Database Management System (DBMS), Relational Database Management System (RDBMS), etc.—are failing the data demand[2]. Big data includes extremely large datasets, meaning that it is not possible for commonly used software tools to manage and process that data within the required time frame[3]. We propose a new hybrid clustering technique to handle big data
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.