Abstract

BackgroundGene expression provides a means for an organism to produce gene products necessary for the organism to live. Variation in the significant gene expression levels can distinguish the gene and the tissue in which the gene is expressed. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression. Therefore, SNPs are good candidates for identifying disease progression. The current bioinformatics literature uses gene network modeling to summarize complex interactions between transcription factors, genes, and gene products. Here, our focus is on the SNPs’ impact on tissue-specific gene expression levels. To the best of our knowledge, we are not aware of any studies that distinguish tissue-specific genes using SNP expression levels.MethodWe propose a novel feature extraction method based on highly expressed SNPs using k-mers as features. We also propose optimal k-mer and feature sizes used in our approach. Determining the optimal sizes is still an open research question as it depends on the dataset and purpose of the analysis. Therefore, we evaluate our algorithm’s performance on a range of k-mer and feature sizes using a multinomial naive Bayes (MNB) classifier on genes in the 49 human tissues from the Genotype-Tissue Expression (GTEx) portal.ConclusionsOur approach achieves practical performance results with k-mers of size 3. Based on the purpose of the analysis and the number of tissue-specific genes under study, feature sizes [7, 8, 9] and [8, 9, 10] are typically optimal for the machine learning model.

Highlights

  • Gene expression provides a means for an organism to produce gene products necessary for the organism to live

  • In contrast to existing machine learning approaches that use each single nucleotide polymorphisms (SNPs) as a feature for single disease prediction, we focus on tissue-specific gene expression across all 49 human tissues available in the Genotype-Tissue Expression (GTEx) portal

  • We show that patterns learned from SNP expression levels with the highest and lowest p-values contain similar discriminatory power

Read more

Summary

Introduction

Gene expression provides a means for an organism to produce gene products necessary for the organism to live. Tissue-specific gene expression, often determined by single nucleotide polymorphisms (SNPs), provides potential molecular markers or therapeutic targets for disease progression. To the best of our knowledge, we are not aware of any studies that distinguish tissue-specific genes using SNP expression levels. Most existing work has focused on developing disease prediction models based on SNPs associated with a single disease only, for example, breast cancer [9], inflammatory bowel disease [10], and obesity [11]. These studies use each SNP as a feature. Reference [12] is a recent survey that provides a detailed review of the different supervised machine learning algorithms for disease prediction

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call