Abstract

Sequence files formats (FASTA and FASTQ) are commonly used in bioinformatics, molecular biology and biochemistry. With the advent of next-generation sequencing (NGS) technologies, the number of FASTQ datasets produced and analyzed has grown exponentially, urging the development of dedicated software to handle, parse, and manipulate such files efficiently. Several bioinformatics packages are available to filter and manipulate FASTA and FASTQ files, yet some essential tasks remain poorly supported, leaving gaps that any workflow analysis of NGS datasets must fill with custom scripts. This can introduce harmful variability and performance bottlenecks in pivotal steps. Here we present a suite of tools, called SeqFu (Sequence Fastx utilities), that provides a broad range of commands to perform both common and specialist operations with ease and is designed to be easily implemented in high-performance analytical pipelines. SeqFu includes high-performance implementation of algorithms to interleave and deinterleave FASTQ files, merge Illumina lanes, and perform various quality controls (identification of degenerate primers, analysis of length statistics, extraction of portions of the datasets). SeqFu dereplicates sequences from multiple files keeping track of their provenance. SeqFu is developed in Nim for high-performance processing, is freely available, and can be installed with the popular package manager Miniconda.

Highlights

  • Format was introduced to store a quality score for each base [1,2]. These two file formats are ubiquitous in bioinformatics, and a broad set of utilities have been released over the years to help the users access and manipulate the sequences

  • SeqFu is written in Nim, a high-performance compiled language, was tested using three compiler versions (1.0, 1.2, and 1.4), and implements the FASTA/FASTQ parsing algorithm written by Heng Li [7], which is available from the repository https://github.com/

  • The FASTQ/FASTA parsing library we adopted allows FASTA or FASTQ files to be used as input files, compressed with or without gzip, and includes support for the less common ‘Sanger FASTQ’ format that allowed a single sequence to span multiple lines

Read more

Summary

Introduction

The FASTA format was introduced in 1985 with the homonym software package developed by Lipman and Pearson [1] It is still the de facto standard format for nucleotide and protein sequences. With the advent of automatic capillary sequencing, the FASTQ format was introduced to store a quality score for each base [1,2]. These two file formats are ubiquitous in bioinformatics, and a broad set of utilities have been released over the years to help the users access and manipulate the sequences

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call