Abstract

High-throughput technologies generate considerable amount of data which often requires bioinformatic expertise to analyze. Here we present High-Throughput Tabular Data Processor (HTDP), a platform independent Java program. HTDP works on any character-delimited column data (e.g. BED, GFF, GTF, PSL, WIG, VCF) from multiple text files and supports merging, filtering and converting of data that is produced in the course of high-throughput experiments. HTDP can also utilize itemized sets of conditions from external files for complex or repetitive filtering/merging tasks. The program is intended to aid global, real-time processing of large data sets using a graphical user interface (GUI). Therefore, no prior expertise in programming, regular expression, or command line usage is required of the user. Additionally, no a priori assumptions are imposed on the internal file composition. We demonstrate the flexibility and potential of HTDP in real-life research tasks including microarray and massively parallel sequencing, i.e. identification of disease predisposing variants in the next generation sequencing data as well as comprehensive concurrent analysis of microarray and sequencing results. We also show the utility of HTDP in technical tasks including data merge, reduction and filtering with external criteria files. HTDP was developed to address functionality that is missing or rudimentary in other GUI software for processing character-delimited column data from high-throughput technologies. Flexibility, in terms of input file handling, provides long term potential functionality in high-throughput analysis pipelines, as the program is not limited by the currently existing applications and data formats. HTDP is available as the Open Source software (https://github.com/pmadanecki/htdp).

Highlights

  • High-throughput technologies, e.g. microarrays and massively parallel sequencing, have become standard tools in genetics

  • High-Throughput Tabular Data Processor (HTDP) was created to meet the demand for efficient graphical user interface (GUI) based processing of high-throughput data, with flexible support for multiple files and character-delimited formats (Table 1)

  • The UCSC Genome Browser tracks can be imported and used as the filtering criteria, while the resulting files can be exported from the HTDP in one of the formats which are used for the custom tracks

Read more

Summary

Introduction

High-throughput technologies, e.g. microarrays and massively parallel sequencing, have become standard tools in genetics. Several formats based on tabular text files with delimiters (e.g. BED, GFF, GTF, WIG, VCF), are widely used for exchange and storage of microarray data (http://www.sanger.ac.uk/resources/software/gff/spec.html; http://mblab.wustl.edu/ GTF22.html; http://genome.ucsc.edu/FAQ/FAQformat.html) [1]. The non-standard features are often ignored by other programs and constitute excess information which creates unnecessary burden in terms of data processing and storage To alleviate these problems several solutions can be used, including standard or specialized unix command line tools (e.g. grep, VCF-tools, BED-tools) [2,4], custom programming (e.g. Perl, Python scripts) [5,6], commercial software solutions and in some cases office spreadsheet software [2,4,7].

Results
Discussion
Methods
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call