Abstract

BackgroundThe high growth of Next Generation Sequencing data currently demands new knowledge extraction methods. In particular, the RNA sequencing gene expression experimental technique stands out for case-control studies on cancer, which can be addressed with supervised machine learning techniques able to extract human interpretable models composed of genes, and their relation to the investigated disease. State of the art rule-based classifiers are designed to extract a single classification model, possibly composed of few relevant genes. Conversely, we aim to create a large knowledge base composed of many rule-based models, and thus determine which genes could be potentially involved in the analyzed tumor. This comprehensive and open access knowledge base is required to disseminate novel insights about cancer.ResultsWe propose CamurWeb, a new method and web-based software that is able to extract multiple and equivalent classification models in form of logic formulas (“if then” rules) and to create a knowledge base of these rules that can be queried and analyzed. The method is based on an iterative classification procedure and an adaptive feature elimination technique that enables the computation of many rule-based models related to the cancer under study. Additionally, CamurWeb includes a user friendly interface for running the software, querying the results, and managing the performed experiments. The user can create her profile, upload her gene expression data, run the classification analyses, and interpret the results with predefined queries. In order to validate the software we apply it to all public available RNA sequencing datasets from The Cancer Genome Atlas database obtaining a large open access knowledge base about cancer. CamurWeb is available at http://bioinformatics.iasi.cnr.it/camurweb.ConclusionsThe experiments prove the validity of CamurWeb, obtaining many classification models and thus several genes that are associated to 21 different cancer types. Finally, the comprehensive knowledge base about cancer and the software tool are released online; interested researchers have free access to them for further studies and to design biological experiments in cancer research.

Highlights

  • The high growth of Generation Sequencing data currently demands new knowledge extraction methods

  • We consider Ribonucleic acid (RNA)-seq Next Generation Sequencing (NGS) experiments related to tumoral samples extracted from the Genomic Data Commons (GDC) [11], a web portal dedicated to cancer care and prevention, which is an evolution of the The Cancer Genome Atlas (TCGA) [12]

  • In order to prove the validity of CamurWeb, we performed a classification analysis on all public available RNA sequencing datasets of The Cancer Genome Atlas database extracted from the Genomic Data Commons portal

Read more

Summary

Introduction

The high growth of Generation Sequencing data currently demands new knowledge extraction methods. The GDC portal publicly provides dataset of the following genomic experiments of more than 40 tumor types: DNA sequencing, Copy Number Variation, Somatic Mutations, DNA Methylation Gene Expression Quantification, and miRNA Expression Quantification These datasets are retrievable with: (i) The GDC Data Portal [14], a web portal that allows browsing, retrieving, and downloading genomic and clinical data; (ii) The GDC Data Transfer Tool [15], a standard client-based software for high performance batch access; (iii) The GDC Application Programming Interface (API) [16] that allows programming or command line access, for searching and downloading subsets of data files based on specific parameters. In order to fully exploit this big data repository, new methods for extracting knowledge are required [7]

Objectives
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call