Validation of Splicing Events in Transcriptome Sequencing Data.

Wolfgang Kaisers,Johannes Ptok,Holger Schwender,Heiner Schaal

doi:10.3390/ijms18061110

Abstract

Genomic alignments of sequenced cellular messenger RNA contain gapped alignments which are interpreted as consequence of intron removal. The resulting gap-sites, genomic locations of alignment gaps, are landmarks representing potential splice-sites. As alignment algorithms report gap-sites with a considerable false discovery rate, validations are required. We describe two quality scores, gap quality score (gqs) and weighted gap information score (wgis), developed for validation of putative splicing events: While gqs solely relies on alignment data wgis additionally considers information from the genomic sequence. FASTQ files obtained from 54 human dermal fibroblast samples were aligned against the human genome (GRCh38) using TopHat and STAR aligner. Statistical properties of gap-sites validated by gqs and wgis were evaluated by their sequence similarity to known exon-intron borders. Within the 54 samples, TopHat identifies 1,000,380 and STAR reports 6,487,577 gap-sites. Due to the lack of strand information, however, the percentage of identified GT-AG gap-sites is rather low. While gap-sites from TopHat contain ≈89% GT-AG, gap-sites from STAR only contain ≈42% GT-AG dinucleotide pairs in merged data from 54 fibroblast samples. Validation with gqs yields 156,251 gap-sites from TopHat alignments and 166,294 from STAR alignments. Validation with wgis yields 770,327 gap-sites from TopHat alignments and 1,065,596 from STAR alignments. Both alignment algorithms, TopHat and STAR, report gap-sites with considerable false discovery rate, which can drastically be reduced by validation with gqs and wgis.

Highlights

Analysis of transcriptome sequencing data focuses on differential expression of genes, as well as alternative splicing
Transcript reconstruction algorithms suffer from inaccuracies [1,2] and even the identification of splice events is associated with a high false discovery rate (FDR) [3]
1,083,629 gap-sites were validated by wgis in either TopHat or STAR alignments

Summary

Introduction

Analysis of transcriptome sequencing data focuses on differential expression of genes, as well as alternative splicing. Genomic alignments of transcriptome sequencing data contain alignment gaps. Alignment gaps are landmarks indicating potential splice-sites. Low specificity of reported alignment gaps may seriously compromise validity of analysis results. Transcript reconstruction algorithms suffer from inaccuracies [1,2] and even the (much simpler) identification of splice events is associated with a high false discovery rate (FDR) [3]. Two approaches for validation of splicing events in transcriptome sequencing data are described and evaluated: An approach solely relying on alignment data (gqs) and a second approach which includes information from genomic sequence (wgis)

Genomic Alignments and Gap-Sites

Intronic Genome Sequence at Gap-Sites

General Considerations on Quality Scores

Sequence Similarity between Gap-Sites and Splice-Sites

Strand Information in wgis

Annotation of Gap-Sites

Alignments from TopHat Aligner

Alignments from STAR Aligner

Comparison of Alignment Numbers

Distribution of gqs

Number of gqs-Validated Gap-Sites

Distribution of IDIN on gqs-Validated Gap-Sites

Validation of wgis

Definition of GQL Limits

Literature

Sequence Logos of Validated Gap-Sites

Global Statistics

Relation between Score Value and Gap-Site Multiplicity

Dependence of Gap-Site Validation on Gap-Site Coverage

Relation between gqs and wgis Validation

Unvalidated Gap-Sites

Maximal Alignment Coverage on Unvalidated Gap-Sites

Performance of Quality Scores

Sensitivity and FDR

Distribution of Gapped Alignments

Validation Strategies

Limitations

Fibroblast Samples

Software

Statistical Evaluation

Conclusions

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: International Journal of Molecular Sciences	Publication Date: May 23, 2017
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

Validation of Splicing Events in Transcriptome Sequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Molecular Sciences

Lead the way for us

Similar Papers

Protein Identification False Discovery Rates for Very Large Proteomics Data Sets Generated by Tandem Mass Spectrometry
Lukas Reiter ... Ruedi Aebersold
Molecular & Cellular Proteomics | VOL. 8
Lukas Reiter, et. al.Lukas Reiter ... Ruedi Aebersold
01 Nov 2009
Molecular & Cellular Proteomics | VOL. 8

Sample Size Estimation for Detection of Splicing Events in Transcriptome Sequencing Data.
Wolfgang Kaisers ... Holger Schwender
International journal of molecular sciences | VOL. 18
Wolfgang Kaisers, et. al.Wolfgang Kaisers ... Holger Schwender
05 Sep 2017
International journal of molecular sciences | VOL. 18

Abstract P5-06-01: Distinctive coding and non-coding RNA profiles of pre-menopausal and post-menopausal benign breast
Jm Carter ... Ep Heinzen
Cancer Research | VOL. 79
Jm Carter, et. al.Jm Carter ... Ep Heinzen
15 Feb 2019
Abstract P5-06-01: Distinctive coding and non-coding RNA profiles of pre-menopausal and post-menopausal benign breast
Jm Carter ... Ep Heinzen

Abstract P3-08-10: A unique coding and non-coding benign breast transcriptome in post-menopausal ER+ breast cancer
Jm Carter ... Tl Hoskin
Cancer Research | VOL. 79
Jm Carter, et. al.Jm Carter ... Tl Hoskin
15 Feb 2019
Abstract P3-08-10: A unique coding and non-coding benign breast transcriptome in post-menopausal ER+ breast cancer
Jm Carter ... Tl Hoskin

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Validation of Splicing Events in Transcriptome Sequencing Data.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: International Journal of Molecular Sciences