Abstract

BackgroundDuring the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. While many clustering algorithms have been developed, they all suffer a significant decrease in computational performance as the size of the dataset being analyzed becomes very large. For example, clustering 10000 genes from an experiment containing 200 microarrays can be quite time consuming and challenging on a desktop PC. One solution to the scalability problem of clustering algorithms is to distribute or parallelize the algorithm across multiple computers.ResultsThe software described in this paper is a high performance multithreaded application that implements a parallelized version of the K-means Clustering algorithm. Most parallel processing applications are not accessible to the general public and require specialized software libraries (e.g. MPI) and specialized hardware configurations. The parallel nature of the application comes from the use of a web service to perform the distance calculations and cluster assignments. Here we show our parallel implementation provides significant performance gains over a wide range of datasets using as little as seven nodes. The software was written in C# and was designed in a modular fashion to provide both deployment flexibility as well as flexibility in the user interface.ConclusionParaKMeans was designed to provide the general scientific community with an easy and manageable client-server application that can be installed on a wide variety of Windows operating systems.

Highlights

  • During the last decade, the use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data

  • Clustering algorithms are used in various fields such as computer graphics, statistics, data mining and biomedical research

  • A serial kmeans algorithm has complexity of N*k*R where R is the number of iterations and N is the number of arrays

Read more

Summary

Introduction

The use of microarrays to assess the transcriptome of many biological systems has generated an enormous amount of data. A common technique used to organize and analyze microarray data is to perform cluster analysis. Data clustering is a process of partitioning a dataset into separate groups ("clusters") containing "similar" data items based on some distance function and does not require a priori knowledge of the groups to which data members belong. The application of high-throughput technologies, e.g. microarrays, in biomedical research generates an enormous amount of high dimensional data that (page number not for citation purposes). The k-means algorithm, introduced by J.B. MacQueen in 1967, is one of the more popular partitioning methods. MacQueen in 1967, is one of the more popular partitioning methods This algorithm groups data into k groups of similar means. A serial kmeans algorithm has complexity of N*k*R where R is the number of iterations and N is the number of arrays

Objectives
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.