FASTAFS: file system virtualisation of random access compressed FASTA files

Youri Hoogstrate,Harmen J G Van De Werken,Guido W Jenster

doi:10.1186/s12859-021-04455-3

Youri Hoogstrate, Harmen J G Van De Werken + Show 1 more

Open Access

https://doi.org/10.1186/s12859-021-04455-3

Copy DOI

Journal: BMC Bioinformatics	Publication Date: Nov 1, 2021
Citations: 4	License type: open-access

Affiliation: Erasmus MC, Erasmus MC Cancer Institute

Abstract

BackgroundThe FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades. The relatively large files require additional files, beyond the scope of the original format, to identify sequences and to provide random access. Multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive. Moreover, these solutions are not directly backwards compatible to FASTA files, resulting in limited software integration.ResultsWe designed a linux based toolkit that virtualises the content of DNA, RNA and protein FASTA archives into the filesystem by using filesystem in userspace. This guarantees in-sync virtualised metadata files and offers fast random-access decompression using bit encodings plus Zstandard (zstd). The toolkit, FASTAFS, can track all its system-wide running instances, allows file integrity verification and can provide, instantly, scriptable access to sequence files and is easy to use and deploy. The file compression ratios were comparable but not superior to other state of the art archival tools, despite the innovative random access feature implemented in FASTAFS.ConclusionsFASTAFS is a user-friendly and easy to deploy backwards compatible generic purpose solution to store and access compressed FASTA files, since it offers file system access to FASTA files as well as in-sync metadata files through file virtualisation. Using virtual filesystems as in-between layer offers format conversion without the need to rewrite code into different programming languages while preserving compatibility.

Highlights

The FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades
Static information is embedded within each file, but needs to be extracted and stored in additional files to complement the FASTA file
Previous methods have focused on the most efficient compression possible, but not on backwards compatibility, interoperability, random access and inclusion of metadata. This is the most probable explanation why gzip, a generic purpose compression method that is suboptimal for this data type, is the most common integrated archive type in bioinformatics applications that use FASTA as input

Summary

Introduction

The FASTA file format, used to store polymeric sequence data, has become a bioinformatics file standard used for decades. Multiple compressors have been developed to archive FASTA files back and forth, but these lack direct access to targeted content or metadata of the archive These solutions are not directly backwards compatible to FASTA files, resulting in limited software integration. FASTA is a file format used for storing nucleotide and amino acid polymeric sequences and is compatible with a high variety of bioinformatics software. It is used as database for ribosomal RNA sequences and for eukaryotic reference genomes and protein databases, that can be several gigabytes in size. In the CRAM data format Generation Sequencing (NGS) alignments are compressed relative to a reference sequence In this format, these reference sequences are addressed using their unique identifier for interoperability. Dict-files are, like fai-index files, beyond the scope of the original file format and have to be generated and maintained after obtaining the FASTA file

Results

Discussion

Conclusion

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

FASTAFS: file system virtualisation of random access compressed FASTA files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics

Lead the way for us

Similar Papers

Applying operating system principles to SDN controller design
Matthew Monaco ... Oliver Michel
-
Matthew Monaco, et. al.Matthew Monaco ... Oliver Michel
21 Nov 2013
21 Nov 2013

Drishti
Shripad Nadgowda ... Canturk Isci
-
Shripad Nadgowda, et. al.Shripad Nadgowda ... Canturk Isci
11 Oct 2018
11 Oct 2018

Suvfs: A virtual file system in userspace that supports large files
Wasim Ahmad Bhat ... S M K Quadri
-
Wasim Ahmad Bhat, et. al.Wasim Ahmad Bhat ... S M K Quadri
01 Oct 2013
01 Oct 2013

Direct-FUSE
Yue Zhu ... Adam Moody
-
Yue Zhu, et. al.Yue Zhu ... Adam Moody
12 Jun 2018
12 Jun 2018

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

FASTAFS: file system virtualisation of random access compressed FASTA files

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: BMC Bioinformatics