Abstract

Genomic alignments of sequenced cellular messenger RNA contain gapped alignments which are interpreted as consequence of intron removal. The resulting gap-sites, genomic locations of alignment gaps, are landmarks representing potential splice-sites. As alignment algorithms report gap-sites with a considerable false discovery rate, validations are required. We describe two quality scores, gap quality score (gqs) and weighted gap information score (wgis), developed for validation of putative splicing events: While gqs solely relies on alignment data wgis additionally considers information from the genomic sequence. FASTQ files obtained from 54 human dermal fibroblast samples were aligned against the human genome (GRCh38) using TopHat and STAR aligner. Statistical properties of gap-sites validated by gqs and wgis were evaluated by their sequence similarity to known exon-intron borders. Within the 54 samples, TopHat identifies 1,000,380 and STAR reports 6,487,577 gap-sites. Due to the lack of strand information, however, the percentage of identified GT-AG gap-sites is rather low. While gap-sites from TopHat contain ≈89% GT-AG, gap-sites from STAR only contain ≈42% GT-AG dinucleotide pairs in merged data from 54 fibroblast samples. Validation with gqs yields 156,251 gap-sites from TopHat alignments and 166,294 from STAR alignments. Validation with wgis yields 770,327 gap-sites from TopHat alignments and 1,065,596 from STAR alignments. Both alignment algorithms, TopHat and STAR, report gap-sites with considerable false discovery rate, which can drastically be reduced by validation with gqs and wgis.

Highlights

  • Analysis of transcriptome sequencing data focuses on differential expression of genes, as well as alternative splicing

  • Transcript reconstruction algorithms suffer from inaccuracies [1,2] and even the identification of splice events is associated with a high false discovery rate (FDR) [3]

  • 1,083,629 gap-sites were validated by wgis in either TopHat or STAR alignments

Read more

Summary

Introduction

Analysis of transcriptome sequencing data focuses on differential expression of genes, as well as alternative splicing. Genomic alignments of transcriptome sequencing data contain alignment gaps. Alignment gaps are landmarks indicating potential splice-sites. Low specificity of reported alignment gaps may seriously compromise validity of analysis results. Transcript reconstruction algorithms suffer from inaccuracies [1,2] and even the (much simpler) identification of splice events is associated with a high false discovery rate (FDR) [3]. Two approaches for validation of splicing events in transcriptome sequencing data are described and evaluated: An approach solely relying on alignment data (gqs) and a second approach which includes information from genomic sequence (wgis)

Genomic Alignments and Gap-Sites
Intronic Genome Sequence at Gap-Sites
General Considerations on Quality Scores
Sequence Similarity between Gap-Sites and Splice-Sites
Strand Information in wgis
Annotation of Gap-Sites
Alignments from TopHat Aligner
Alignments from STAR Aligner
Comparison of Alignment Numbers
Distribution of gqs
Number of gqs-Validated Gap-Sites
Distribution of IDIN on gqs-Validated Gap-Sites
Validation of wgis
Definition of GQL Limits
Literature
Sequence Logos of Validated Gap-Sites
Global Statistics
Relation between Score Value and Gap-Site Multiplicity
Dependence of Gap-Site Validation on Gap-Site Coverage
Relation between gqs and wgis Validation
Unvalidated Gap-Sites
Maximal Alignment Coverage on Unvalidated Gap-Sites
Performance of Quality Scores
Sensitivity and FDR
Distribution of Gapped Alignments
Validation Strategies
Limitations
Fibroblast Samples
Software
Statistical Evaluation
Conclusions
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.