Indexing and Search of Order-Preserving Submatrix for Gene Expression Data

Bolin Chen,Guoyu Xu,Tao Jiang,Juntao Li

doi:10.1109/access.2019.2960856

Abstract

Bicluster pattern discovery plays a key role in analysis of gene expression data. One vital model of bicluster mining is Order-Preserving SubMatrix (OPSM), which finds similar tendency of some genes on some conditions. Most of the OPSM discovery methods are batch mining techniques and not suitable for low latency data query. To make data analysis efficient and effective, in this paper, we first propose a prefix-tree based indexing method <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">pfTree</i> , then give an optimization technique <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">pIndex</i> that employs row and column header tables to search the positive, negative and time-delayed OPSMs. Meanwhile, we present an online sharing query technique to accelerate the frequent searches. Finally, we conduct extensive experiments and compare our methods with the existing approaches. Experimental results demonstrate the efficiency and effectiveness of the proposed methods.

Highlights

Gene microarray technology gives the chances for monitoring of the expression level of huge genes on many experiments simultaneously
Order-Preserving SubMatrix (OPSM) QUERIES we explore the multiple types of OPSM queries, which include positive, negative, and time-delayed OPSM queries, based on pIndex with two header tables
GENERAL OPSM QUERIES Based on the Positive OPSM query method, we present a general query method for multiple types of OPSM search, Algorithm 10, which consists of Positive OPSM query, Negative OPSM query, and Time-delayed OPSM query

Summary

INTRODUCTION

Gene microarray technology gives the chances for monitoring of the expression level of huge genes on many experiments simultaneously. In order to improve query efficiency, two header tables are added to the pfTree and named it as pIndex Both of these structures can index two kinds of data, i.e., gene expression data and OPSM data, and OPSMs can be queried directly on them, it eliminates the process of mining OPSM from gene expression data. PIndex uses the row and column header tables to update the index and query OPSMs. To further improve query performance, two pruning methods are proposed to reduce the traversal of useless branches. Especially when executing fuzzy queries, take more than one second, an online sharing query technique is necessary to proposed to reduce the cost of frequent and time-consuming searches It applies two indexes pfTree and pIndex on two kinds of datasets, i.e., gene expression and OPSM datasets.

PRELIMINARIES

OPSM QUERIES

ONLINE SHARING QUERIES

EXPERIMENTAL EVALUATION

Findings

VIII. CONCLUSION