Abstract

BackgroundUnsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. However, these algorithms require a user to select a biologically appropriate latent space dimensionality. In practice, most researchers fit a single algorithm and latent dimensionality. We sought to determine the extent by which selecting only one fit limits the biological features captured in the latent representations and, consequently, limits what can be discovered with subsequent analyses.ResultsWe compress gene expression data from three large datasets consisting of adult normal tissue, adult cancer tissue, and pediatric cancer tissue. We train many different models across a large range of latent space dimensionalities and observe various performance differences. We identify more curated pathway gene sets significantly associated with individual dimensions in denoising autoencoder and variational autoencoder models trained using an intermediate number of latent dimensionalities. Combining compressed features across algorithms and dimensionalities captures the most pathway-associated representations. When trained with different latent dimensionalities, models learn strongly associated and generalizable biological representations including sex, neuroblastoma MYCN amplification, and cell types. Stronger signals, such as tumor type, are best captured in models trained at lower dimensionalities, while more subtle signals such as pathway activity are best identified in models trained with more latent dimensionalities.ConclusionsThere is no single best latent dimensionality or compression algorithm for analyzing gene expression data. Instead, using features derived from different compression models across multiple latent space dimensionalities enhances biological representations.

Highlights

  • Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation

  • We used real and permuted data and initialized each model five times per latent dimensionality resulting in a total of 4200 different compression models (Additional file 1: Figure S1)

  • There were some potentially mischaracterized samples, feature 12 in variational autoencoders (VAE) k = 50 robustly separated MYCN amplification status in NBL tumors (t = − 18.5, p = 6.6 × 10− 38) (Fig. 3f). This feature distinguished MYCN amplification status in NBL cell lines [29] that were previously not used in training the compression model or for feature selection (t = − 3.2, p = 4.2 × 10− 3) (Fig. 3g). These analyses demonstrate that different compression models best identify specific biological representations when trained with different latent space dimensionalities

Read more

Summary

Introduction

Unsupervised compression algorithms applied to gene expression data extract latent or hidden signals representing technical and biological sources of variation. Linear methods such as principal component analysis (PCA), independent component analysis (ICA), and non-negative matrix factorization (NMF) have been applied to large transcriptomic compendium to reveal the influence of copy number alterations on gene expression measurements, to identify coordinated transcriptional programs, and to estimate cell-type proportion in bulk tissue samples [1,2,3,4,5] Nonlinear methods such as denoising autoencoders (DAE) and variational autoencoders (VAE) have revealed latent signals characterizing oxygen exposure, transcription factor targets, cancer subtypes, and drug response [6,7,8,9]. We focus on using compression algorithms to identify biological representations by analyzing processed data with batch effect already mitigated

Objectives
Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call