CBSSD: community-based semantic subgroup discovery

Blaž Škrlj,Jan Kralj,Nada Lavrač

doi:10.1007/s10844-019-00545-0

Blaž Škrlj, Jan Kralj + Show 1 more

Open Access

PDF Available

https://doi.org/10.1007/s10844-019-00545-0

Copy DOI

Export

Save

Cite

Abstract
Highlights/Summary
Full-Text PDF
Similar Papers

Abstract

Listen

Modern data mining algorithms frequently need to address the task of learning from heterogeneous data, including various sources of background knowledge. A data mining task where ontologies are used as background knowledge in data analysis is referred to as semantic data mining. A specific semantic data mining task is semantic subgroup discovery: a rule learning approach enabling ontology terms to be used in subgroup descriptions learned from class labeled data. This paper presents Community-Based Semantic Subgroup Discovery (CBSSD), a novel approach that advances ontology-based subgroup identification by exploiting the structural properties of induced complex networks related to the studied phenomenon. Following the idea of multi-view learning, using different sources of information to obtain better models, the CBSSD approach can leverage different types of nodes of the induced complex network, simultaneously using information from multiple levels of a biological system. The approach was tested on ten data sets consisting of genes related to complex diseases, as well as core metabolic processes. The experimental results demonstrate that the CBSSD approach is scalable, applicable to large complex networks, and that it can be used to identify significant combinations of terms, which can not be uncovered by contemporary term enrichment analysis approaches.

Highlights

Modern machine learning approaches are capable of using continuously increasing amounts of information to explain complex systems in numerous fields, including biology, sociology, mechanics and electrical engineering
We evaluate the performance based on three Weighted relative accuracy (WRAcc) measures introduced in Section 5.2, as well as the computational costs associated with different approaches
The BioMine network appears to have had a noticeable effect on performance, as it serves as the background network for the top three approaches

Summary

Introduction

Modern machine learning approaches are capable of using continuously increasing amounts of information to explain complex systems in numerous fields, including biology, sociology, mechanics and electrical engineering. As there can be many distinct types of data associated with a single system, novel approaches strive towards the integration of different, heterogeneous data and knowledge sources used as data in learning predictive or descriptive models (Chen et al 2014). In such settings, prior knowledge can play an important role in the development and deployment of learning algorithms in real world scenarios. Bayesian methods can be leveraged to incorporate implicit knowledge about prior states of a system, i.e. prior distributions of random variables being modeled Such methods are in widespread use, e.g., in the field of phylogenetics, where Bayesian inference is used for reconstruction of evolutionary trees (Drummond and Rambaut 2007). Machine learning research that relies on the use of explicitly encoded background knowledge includes relational data mining (RDM) (Dzeroski and Lavrac 2001) and inductive logic programming (ILP) (Muggleton 1991; Lavracand Dzeroski 1994), where the background knowledge is used along with the training examples to derive hypotheses in the form of logical rules, which explain the positive examples

Objectives

Results

Conclusion