GMLC: a multi-label feature selection framework for graph classification

Xiangnan Kong,Philip S Yu

doi:10.1007/s10115-011-0407-3

Xiangnan Kong, Philip S Yu

Open Access

https://doi.org/10.1007/s10115-011-0407-3

Copy DOI

Abstract

Graph classification has been showing critical importance in a wide variety of applications, e.g. drug activity predictions and toxicology analysis. Current research on graph classification focuses on single-label settings. However, in many applications, each graph data can be assigned with a set of multiple labels simultaneously. Extracting good features using multiple labels of the graphs becomes an important step before graph classification. In this paper, we study the problem of multi-label feature selection for graph classification and propose a novel solution, called gMLC, to efficiently search for optimal subgraph features for graph objects with multiple labels. Different from existing feature selection methods in vector spaces that assume the feature set is given, we perform multi-label feature selection for graph data in a progressive way together with the subgraph feature mining process. We derive an evaluation criterion to estimate the dependence between subgraph features and multiple labels of graphs. Then, a branch-and-bound algorithm is proposed to efficiently search for optimal subgraph features by judiciously pruning the subgraph search space using multiple labels. Empirical studies demonstrate that our feature selection approach can effectively boost multi-label graph classification performances and is more efficient by pruning the subgraph search space using multiple labels.

Highlights

Due to the recent advances of data collection technology, many application fields are facing various data with complex structures, e.g., chemical compounds, program flows and XML web documents
We focus on the subgraph-based graph classification problem, which assumes that a graph object Gi is represented as a binary vector xi = [x1i, · · ·, xm i ]⊤ associated with a set of subgraph patterns {g1, · · ·, gm}
We briefly review the general idea of gSpan approach: Instead of enumerating subgraphs and testing for isomorphism, they first build a lexicographic order over all the edges of a graph, and map each graph to an unique minimum depth-first search (DFS) code as its canonical label

Summary

Introduction

Due to the recent advances of data collection technology, many application fields are facing various data with complex structures, e.g., chemical compounds, program flows and XML web documents.

Methods

Results

Conclusion