RASMA: a reverse search algorithm for mining maximal frequent subgraphs

Saeed Salem,Mohammed Alokshiya,Mohammad Al Hasan

doi:10.1186/s13040-021-00250-1

Abstract

BackgroundGiven a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Maximal frequent subgraphs are a representative set of frequent subgraphs; A frequent subgraph is maximal if it does not have a super-graph that is frequent. In the bioinformatics discipline, methodologies for mining frequent and/or maximal frequent subgraphs can be used to discover interesting network motifs that elucidate complex interactions among genes, reflected through the edges of the frequent subnetworks. Further study of frequent coexpression subnetworks enhances the discovery of biological modules and biological signatures for gene expression and disease classification.ResultsWe propose a reverse search algorithm, called RASMA, for mining frequent and maximal frequent subgraphs in a given collection of graphs. A key innovation in RASMA is a connected subgraph enumerator that uses a reverse-search strategy to enumerate connected subgraphs of an undirected graph. Using this enumeration strategy, RASMA obtains all maximal frequent subgraphs very efficiently. To overcome the computationally prohibitive task of enumerating all frequent subgraphs while mining for the maximal frequent subgraphs, RASMA employs several pruning strategies that substantially improve its overall runtime performance. Experimental results show that on large gene coexpression networks, the proposed algorithm efficiently mines biologically relevant maximal frequent subgraphs.ConclusionExtracting recurrent gene coexpression subnetworks from multiple gene expression experiments enables the discovery of functional modules and subnetwork biomarkers. We have proposed a reverse search algorithm for mining maximal frequent subnetworks. Enrichment analysis of the extracted maximal frequent subnetworks reveals that subnetworks that are frequent are highly enriched with known biological ontologies.

Highlights

Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs
We tested the performance of RASMA on mining frequent and maximal frequent subgraphs from gene coexpression networks
Frequent coexpression subnetworks have been shown to be effective in functional annotation and subnetwork biomarker discovery

Summary

Introduction

Given a collection of coexpression networks over a set of genes, identifying subnetworks that appear frequently is an important research problem known as mining frequent subgraphs. Gene expression analysis on such microarray data is used for discovering gene clusters that have similar expression profiles. Such analysis can be used for obtaining dysregulated genes that can be used as markers for solving various disease classification tasks. Given a gene expression dataset, a coexpression network is built in which the nodes represent genes and a link exists between a pair of genes if the corresponding genes exhibit significant correlation in the microarray analysis [2, 3]. Multiple gene expression datasets can be analyzed concurrently in a single study. Recent research has focused on mining biologically interesting gene coexpression subneworks from multiple heterogeneous gene expression datasets

Methods

Results

Conclusion