Abstract

BackgroundA genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Analysis of signal tracks requires complex computational methods, which often make the analysts focus too much on the detailed computational steps rather than on their biological questions.ResultsHere we propose Signal Track Query Language (STQL) for simple analysis of signal tracks. It is a Structured Query Language (SQL)-like declarative language, which means one only specifies what computations need to be done but not how these computations are to be carried out. STQL provides a rich set of constructs for manipulating genomic intervals and their values. To run STQL queries, we have developed the Signal Track Analytical Research Tool (START, http://yiplab.cse.cuhk.edu.hk/start/), a system that includes a Web-based user interface and a back-end execution system. The user interface helps users select data from our database of around 10,000 commonly-used public signal tracks, manage their own tracks, and construct, store and share STQL queries. The back-end system automatically translates STQL queries into optimized low-level programs and runs them on a computer cluster in parallel. We use STQL to perform 14 representative analytical tasks. By repeating these analyses using bedtools, Galaxy and custom Python scripts, we show that the STQL solution is usually the simplest, and the parallel execution achieves significant speed-up with large data files. Finally, we describe how a biologist with minimal formal training in computer programming self-learned STQL to analyze DNA methylation data we produced from 60 pairs of hepatocellular carcinoma (HCC) samples.ConclusionsOverall, STQL and START provide a generic way for analyzing a large number of genomic signal tracks in parallel easily.

Highlights

  • A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments

  • Comparison with other approaches To evaluate the simplicity of Signal Track Query Language (STQL) and the correctness and efficiency of Signal Track Analytical Research Tool (START) in executing STQL queries, we compared STQL with three other approaches in performing the same analysis tasks

  • In this paper, we have described the Signal Track Query Language (STQL), an Structured Query Language (SQL)-like declarative language that allows users to perform a variety of analysis by specifying only the analysis goals rather than all the computational details

Read more

Summary

Introduction

A genomic signal track is a set of genomic intervals associated with values of various types, such as measurements from high-throughput experiments. Zhu et al BMC Genomics (2017) 18:749 data from a ChIP-seq (chromatin immunoprecipitation followed by high-throughput sequencing) experiment are represented as a signal track, at the basic level, each interval corresponds to a single genomic location and the associated value is the number of aligned reads that cover the location. One could use a gene annotation set to define intervals of interest (e.g., promoters), and compute the average number of covering reads at each interval as its signal value. In each of these three cases, the ChIP-seq data are represented by a signal track. The generality of representing high-throughput sequencing data by signal tracks is exemplified by its prevalent use in genome browsers for displaying many types of sequencing data

Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call