Abstract

Identifying variants from RNA-seq (transcriptome sequencing) data is a cost-effective and versatile alternative to whole-genome sequencing. However, current variant callers do not generally behave well with RNA-seq data due to reads encompassing intronic regions. We have developed a software programme called Opossum to address this problem. Opossum pre-processes RNA-seq reads prior to variant calling, and although it has been designed to work specifically with Platypus, it can be used equally well with other variant callers such as GATK HaplotypeCaller. In this work, we show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline. In addition, using it in combination with Platypus offers a substantial reduction in run times compared to the GATK pipeline so it is ideal when there are only limited time or computational resources available.

Highlights

  • RNA-seq[1] is routinely employed for gene expression analysis, but it can be used to identify genomic variants in expressed regions alongside whole-exome (WES) and whole-genome sequencing (WGS)

  • We show that using Opossum in conjunction with either Platypus or GATK HaplotypeCaller maintains precision and improves the sensitivity for SNP detection compared to the GATK Best Practices pipeline

  • A few pipelines for detecting SNPs in RNA-seq data have been released to address these challenges. eSNV-detect by Tang et al.[3] employs a combination of mappers to overcome systematic errors of individual aligners, followed by variant calling with Samtools and Bcftools

Read more

Summary

Introduction

RNA-seq (transcriptome sequencing)[1] is routinely employed for gene expression analysis, but it can be used to identify genomic variants in expressed regions alongside whole-exome (WES) and whole-genome sequencing (WGS). The approach works well on length scales of up to a few kilobases (typically up to 1.5–2 kb) but longer reads (e.g. reads mapping across large introns) would disrupt it For this reason Platypus should not be run directly on RNA-seq data. We have developed a software tool called Opossum[6] to process and filter RNA-seq data and make it suitable for (haplotype-based) variant calling. The presence of splice junctions in RNA-seq data means that reads which have been mapped across splice junctions must be split to remove intronic parts which would otherwise disrupt variant calling. Our approach shows promising results, maintaining high precision and improving sensitivity in detecting SNP variant calls compared to the GATK Best Practices pipeline. We have used the strongly validated GIAB (Genome in a Bottle) dataset[10]

Methods
Results
Oikkonen LE
11. ENCODE Project Consortium
Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.