Abstract

Order-preserving submatrices (OPSMs) have been applied in many fields, such as DNA microarray data analysis, automatic recommendation systems, and target marketing systems, as an important unsupervised learning model. Unfortunately, most existing methods are heuristic algorithms which are unable to reveal OPSMs entirely in NP-complete problem. In particular, deep OPSMs, corresponding to long patterns with few supporting sequences, incur explosive computational costs and are completely pruned by most popular methods. In this paper, we propose an exact method to discover all OPSMs based on frequent sequential pattern mining. First, an existing algorithm was adjusted to disclose all common subsequence (ACS) between every two row sequences, and therefore all deep OPSMs will not be missed. Then, an improved data structure for prefix tree was used to store and traverse ACS, and Apriori principle was employed to efficiently mine the frequent sequential pattern. Finally, experiments were implemented on gene and synthetic datasets. Results demonstrated the effectiveness and efficiency of this method.

Highlights

  • Recent numerous high-throughput developments in DNA chips generate massive gene expression results, which are represented as matrix D of real numbers with rows to represent the genes and columns to represent the different environmental conditions, different organs, or even different individuals

  • This paper focuses on pattern-based subspace clustering, known as order-preserving submatrix (OPSM) model

  • The real data set was yeast galactose data of [18, 19], which was 205 × 80 real microarray data set obtained from a study of gene response to the knockout of various genes in galactose utilization (GAL) pathway of baker’s yeast, with columns corresponding to the knockout conditions and rows corresponding to genes that exhibit responses to the knockouts

Read more

Summary

Introduction

Recent numerous high-throughput developments in DNA chips generate massive gene expression results, which are represented as matrix D of real numbers with rows (objects) to represent the genes and columns (attributes) to represent the different environmental conditions, different organs, or even different individuals. To analyze the gene expression data, clustering is widely used to gather the objects into different clusters based on similarity. The objects in the same cluster are as similar as possible. Genes in the same cluster may show similar cellular function or expression mode, implying that they are more likely to be involved in the same cellular process. Similarity measurements are mainly based on distance functions, including the Euclidean distance and Manhattan distance. These distance functions are not appropriate to measure the object correlation in the gene matrix [1]. Only a small subset of genes participate in any cellular process of interest, and a cellular process occurs only in a subset of the samples, requiring biclustering or the subspace clustering to capture clusters formed by a subset of genes across a subset of samples [2]

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call