Abstract

Data clustering methods have become standard techniques in the analysis of gene expression data. They are used in a variety of tasks ranging from simple data pre- treatment for posterior analysis to the identification of important information, such as gene function and/or the participation of a group of genes in a given biological process. Data clustering methods also offer advantages to the biologist from the economic point of view and given the time that would be necessary to obtain this type of information without the aid of intelligent computational methods. This work aims at guiding the choices in order to get the best possible solution from data clustering. To do so, algorithms from different approaches were used, i.e. k-means and SOM algorithms belong to the unidimentional approach and SAMBA algorithm, a bidimentional approach. Methods of statistical and biological validation were employed in order to choose the best data clustering solution. Results presented here demonstrated that the statistic validation methods were hardly in agreement with the biology validation method. Furthermore, some advantages of the SOM algorithm over the k-means algorithm were observed. Use of the bidimentional algorithm SAMBA revealed dataset structure not identified by the unidimentional algorithms. It was possible to aggregate meaningfull biological information to genes of unknown function. All the content of this work, including all the data clustering and detailed analysis are available at the URL http://www.ppgia.pucpr.br/~nievola/clusteranalysis.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call