Abstract
The large size and high complexity of biological data can represent a major methodological challenge for the analysis and exchange of data sets between computers and applications. There has also been a substantial increase in the amount of metadata associated with biological data sets, which is being increasingly incorporated into existing data formats. Despite the existence of structured formats based on XML, biological data sets are mainly formatted using unstructured file formats, and the incorporation of metadata results in increasingly complex parsing routines such that they become more error prone. To overcome these problems, we present the “biological object notation” (BON) format, a new way to exchange and parse nearly all biological data sets more efficiently and with less error than other currently available formats. Based on JavaScript Object Notation (JSON), BON simplifies parsing by clearly separating the biological data from its metadata and reduces complexity compared to XML based formats. The ability to selectively compress data up to 87% compared to other file formats and the reduced complexity results in improved transfer times and less error prone applications.
Highlights
Biological data, which includes, but is not limited to molecular sequences, annotations and phylogenetic trees, are still predominantly exchanged as flat files or in line-based formats despite the existence of more structured file notations that are better suited to complex data
NCBI’s Entrez utility or Representational state transfer” (REST) application programming interfaces (APIs) only export biological data in FASTA and XML formats, other information is available in JSON9
To demonstrate the versatility of biological object notation” (BON) we designed a method to encode phylogenetic trees using the JavaScript Object Notation (JSON) syntax based on NeXML6 and which allows the addition of arbitrary metadata (Fig. 3e; Supplementary Table 5)
Summary
Biological data, which includes, but is not limited to molecular sequences, annotations and phylogenetic trees, are still predominantly exchanged as flat files or in line-based formats despite the existence of more structured file notations that are better suited to complex data. These structures can describe virtually all biological data sets while retaining low parsing complexity, for example because additional checks like those for attribute values in XML tags are omitted. The TinySeq and uncompressed BON files in the Genome and Collection data sets were almost identical in size, with the exception of the Plant EST subset that was ~18% smaller in BON. The compressed BON files were between 43% and 70% smaller in the nucleotide sequence data sets (Fig. 3a,b; Supplementary Table 2).
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.