CSFC: A New Centroid Based Clustering Method to Improve the Efficiency of Storing and Accessing Small Files in Hadoop

doi:10.35940/ijrte.d1014.1284s519

Abstract

In day to day life, the computer plays a major role, due to this advancement of technology collection of data from various fields are increasing. A large amount of data is produced by various fields for every second and is not easy to process. This large amount of data is called as Big data. A large number of small files also considered as Big data. It's not easy to process and store the small files in Hadoop. In the existing methods Merging technologies and Clustering Techniques are used to combine smaller files to large files up to 128 MB before sending it to HDFS in Hadoop. In the Proposed system CSFC (Clustering Small Files based on Centroid) Clustering Technique is used without mentioning the number of Clusters previously because if the clusters are mentioned before, all the files are clubbed within the limited number of clusters. In proposing system clusters are generated by depending on the number of related files in the dataset. The relevant files are combined up to 128 MB in a cluster. If any file is not relevant to the existing cluster or if the memory size reached 128MB then-new cluster will be generated and the file will be stored. It is easy to process the related files, comparing two irrelevant files. By using this method fetching data from the data node, it produces efficient result when comparing with other clustering techniques.

Full Text