Abstract

Gene expression in individual cells can now be measured for thousands of cells in a single experiment thanks to innovative sample-preparation and sequencing technologies. State-of-the-art computational pipelines for single-cell RNA-sequencing data, however, still employ computational methods that were developed for traditional bulk RNA-sequencing data, thus not accounting for the peculiarities of single-cell data, such as sparseness and zero-inflated counts. Here, we present a ready-to-use pipeline named gf-icf (gene frequency–inverse cell frequency) for normalization of raw counts, feature selection, and dimensionality reduction of scRNA-seq data for their visualization and subsequent analyses. Our work is based on a data transformation model named term frequency–inverse document frequency (TF-IDF), which has been extensively used in the field of text mining where extremely sparse and zero-inflated data are common. Using benchmark scRNA-seq datasets, we show that the gf-icf pipeline outperforms existing state-of-the-art methods in terms of improved visualization and ability to separate and distinguish different cell types.

Highlights

  • Until very recently, the cost, time, and equipment needed to perform single-cell transcriptomics have limited their application to a few selected studies

  • We aimed at developing a computational tool that could integrate single-cell transcriptional profiles across multiple conditions by extracting relevant genes to improve data visualization and cell type identification

  • The intuition behind the use of the term frequency–inverse document frequency (TF-IDF) approach to scRNA-seq data is that if a gene is highly expressed in a cell, it should be scored highly than less expressed genes in the same cell, but at the same time, highly expressed genes common to many cells of different types should be scored lower than genes expressed in a specific subpopulation of cells

Read more

Summary

Introduction

The cost, time, and equipment needed to perform single-cell transcriptomics have limited their application to a few selected studies. State-of-the-art computational pipelines for scRNA-seq data visualization consist in four main steps (Trapnell et al, 2014; Klein et al, 2015; Macosko et al, 2015; Shekhar et al, 2016; Zheng et al, 2017; Butler et al, 2018): i) normalizations of raw counts scaled by a sample-specific size factors; Analyse scRNA-seq Data With Text Mining ii) feature selection by identifying most variable genes across cells; iii) dimensionality reduction with principal component analysis (PCA); and iv) projection of scRNA-seq data in an embedded space [such as t-SNE or UMAP tools (van der Maaten and Hinton, 2008; McInnes and Healy, 2018)] Most steps in these pipelines, still employ computational methods that were developed for traditional bulk RNA-seq data, not accounting for the high level of noise caused by dropouts, leading to an excess of zeros and near-zero counts in the dataset

Methods
Results
Conclusion
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.