A Hybrid Distance Measure for Clustering Expressed Sequence Tags Originating from the Same Gene Family

Keng-Hoong Ng,Chin-Kuan Ho,Somnuk Phon-Amnuaisuk

doi:10.1371/journal.pone.0047216

Keng-Hoong Ng, Chin-Kuan Ho + Show 1 more

Open Access

PDF Available

https://doi.org/10.1371/journal.pone.0047216

Copy DOI

Export

Save

Cite

Journal: PLoS ONE	Publication Date: Oct 11, 2012
Citations: 5	License type: CC BY 4.0

Affiliation: Multimedia University, Universiti Tunku Abdul Rahman

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

BackgroundClustering is a key step in the processing of Expressed Sequence Tags (ESTs). The primary goal of clustering is to put ESTs from the same transcript of a single gene into a unique cluster. Recent EST clustering algorithms mostly adopt the alignment-free distance measures, where they tend to yield acceptable clustering accuracies with reasonable computational time. Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes.Methodology/Principal FindingsWe propose a hybrid distance measure that combines the global and local features extracted from ESTs, with the aim to address the clustering problem faced by ESTs derived from the same gene family. The clustering process is implemented using the DBSCAN algorithm. We test the hybrid distance measure on the ten EST datasets, and the clustering results are compared with the two alignment-free EST clustering tools, i.e. wcd and PEACE. The clustering results indicate that the proposed hybrid distance measure performs relatively better (in terms of clustering accuracy) than both EST clustering tools.Conclusions/SignificanceThe clustering results provide support for the effectiveness of the proposed hybrid distance measure in solving the clustering problem for ESTs that originate from the same gene family. The improvement of clustering accuracies on the experimental datasets has supported the claim that the sensitivity of the hybrid distance measure is sufficient to solve the clustering problem.

Highlights

Sequencing techniques have progressed rapidly in recent years, various types of sequence data have been produced and they are publicly available for research purpose
The correctness of our clustering result is evaluated based on the Expressed Sequence Tags (ESTs) libraries from the genome browser, where the libraries are constructed based on the alignment of ESTs on the human genome assembly [45]
The same datasets are tested with two alignment-free EST clustering tools and their clustering results are compared and discussed

Summary

Introduction

Sequencing techniques have progressed rapidly in recent years, various types of sequence data have been produced and they are publicly available for research purpose. Despite many genome assemblies are available at present, research on expressed sequence tag (EST) is still on-going, due to it is a cost-effective resource for expression data analysis [1], [2], functional analysis [3], and single-nucleotide polymorphisms [4]. One of the key steps in the EST processing pipeline is clustering. Clustering is a key step in the processing of Expressed Sequence Tags (ESTs). Despite the fact that these clustering methods work satisfactorily on a majority of the EST datasets, they have a common weakness. They are prone to deliver unsatisfactory clustering results when dealing with ESTs from the genes derived from the same family. The root cause is the distance measures applied on them are not sensitive enough to separate these closely related genes

Objectives

Methods

Results

Conclusion