Abstract
3D macromolecular structural data is growing ever more complex and plentiful in the wake of substantive advances in experimental and computational structure determination methods including macromolecular crystallography, cryo-electron microscopy, and integrative methods. Efficient means of working with 3D macromolecular structural data for archiving, analyses, and visualization are central to facilitating interoperability and reusability in compliance with the FAIR Principles. We address two challenges posed by growth in data size and complexity. First, data size is reduced by bespoke compression techniques. Second, complexity is managed through improved software tooling and fully leveraging available data dictionary schemas. To this end, we introduce BinaryCIF, a serialization of Crystallographic Information File (CIF) format files that maintains full compatibility to related data schemas, such as PDBx/mmCIF, while reducing file sizes by more than a factor of two versus gzip compressed CIF files. Moreover, for the largest structures, BinaryCIF provides even better compression—factor ten and four versus CIF files and gzipped CIF files, respectively. Herein, we describe CIFTools, a set of libraries in Java and TypeScript for generic and typed handling of CIF and BinaryCIF files. Together, BinaryCIF and CIFTools enable lightweight, efficient, and extensible handling of 3D macromolecular structural data.
Highlights
Structural biologists are routinely using macromolecular crystallography (MX) and threedimensional (3D) electron microscopy (3DEM) to produce atomic-level structural models of large biomolecular machines and depositing them to the single global archive of macromolecular structure data, known as the Protein Data Bank (PDB) [1]
BinaryCIF and mmCIF files were annotated with the chem_comp_bond category using Mol [18]
The original version of the archive in BinaryCIF greatly benefits from gzip compression because the employed encoding strategies of each column are described as quite verbose strings (e.g. StringArray) that can be compressed efficiently
Summary
Structural biologists are routinely using macromolecular crystallography (MX) and threedimensional (3D) electron microscopy (3DEM) to produce atomic-level structural models (hereafter structures) of large biomolecular machines and depositing them to the single global archive of macromolecular structure data, known as the Protein Data Bank (PDB) [1]. Even larger and more complex 3D structures, such as the Nuclear Pore Complex (PDBDEV_00000012 [2]), are coming from integrative (or hybrid) methods (IM) [3] that use multiple, complementary experimental and computational methods. These evolving and emerging structure determination methods require many new data items to (i) describe the state of the macromolecular system, (ii) to reflect the provenance, complexity, and quality of the underlying experimental data, (iii) to enumerate the computational procedure(s) used for 3D structure modeling, and (iv) to provide assessments of the validity of the structural model versus chemical reference and experimental data. Larger molecular assemblies that are not resolvable at the atomic level require new multi-scale descriptions (e.g., coarse-grained beads representing single amino acid residues or irregular polygons representing protein domains or entire polypeptide chains)
Published Version (Free)
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have