A Simple Guideline to Assess the Characteristics of RNA-Seq Data.

Keunhong Son,Keunsoo Kang,Kyudong Han,Sungryul Yu,Wonseok Shin

doi:10.1155/2018/2906292

Keunhong Son, Keunsoo Kang + Show 3 more

Open Access

https://doi.org/10.1155/2018/2906292

Copy DOI

Abstract

Next-generation sequencing (NGS) techniques have been used to generate various molecular maps including genomes, epigenomes, and transcriptomes. Transcriptomes from a given cell population can be profiled via RNA-seq. However, there is no simple way to assess the characteristics of RNA-seq data systematically. In this study, we provide a simple method that can intuitively evaluate RNA-seq data using two different principal component analysis (PCA) plots. The gene expression PCA plot provides insights into the association between samples, while the transcript integrity number (TIN) score plot provides a quality map of given RNA-seq data. With this approach, we found that RNA-seq datasets deposited in public repositories often contain a few low-quality RNA-seq data that can lead to misinterpretations. The effect of sampling errors for differentially expressed gene (DEG) analysis was evaluated with ten RNA-seq data from invasive ductal carcinoma tissues and three RNA-seq data from adjacent normal tissues taken from a Korean breast cancer patient. The evaluation demonstrated that sampling errors, which select samples that do not represent a given population, can lead to different interpretations when conducting the DEG analysis. Therefore, the proposed approach can be used to avoid sampling errors prior to RNA-seq data analysis.

Highlights

Recent advances in DNA sequencing technology led by nextgeneration sequencing (NGS) have been generating various molecular maps, including genomes, transcriptomes, and epigenomes [1,2,3]
The C0 sample was located far from the other cancer samples in the gene expression principal component analysis (PCA) plot, which suggested that this sample was from a spatially distinct region compared to the other cancer samples
The N3 sample was located close to the cancer cluster in the gene expression PCA plot, even though it was from adjacent normal tissue

Summary

Introduction

Recent advances in DNA sequencing technology led by nextgeneration sequencing (NGS) have been generating various molecular maps, including genomes, transcriptomes, and epigenomes [1,2,3]. NGS-based transcriptomic data called RNA-seq (or expression profiling by high throughput sequencing) is one of the most abundant data types according to statistics from the gene expression omnibus database (https://www.ncbi.nlm.nih.gov/geo/summary/?type=series) [4]. No more than two or three biological replicates are generally used as representative samples for a population of a given condition (e.g., a disease), due to the high cost of RNA-seq or difficulty obtaining samples. This means that individual RNA-seq data will greatly affect the outcome. The gene expression PCA plot provides a map of the distances

Methods

Results

Conclusion