To The Editor: The call for data access standards in mass spectrometry-based proteomics has led to proposals focused on the extraction of native data to XML-based formats. 1,2 While self-describing and human-readable formats represent laudable goals, particularly for archival purposes, they are not well suited to large numeric datasets. Consequently, while metadata in mzML2 remain human-readable, the vast majority of the file is devoted to a hexadecimal representation of the mass spectra. Moreover, the transition from mzXML to a true XML format (mzML2) eliminates embedded indexing schemes; consequently, extracted files are compromised in both content and access efficiency.1,3 Based on similarities in data structure and access patterns, we suggest that fields such as astronomy are better models for proteomics data analysis (Figure 1). These fields also rely on common formats, but typically utilize binary standards such as HDF54 or netCDF5. By contrast, the commercial nature of mass spectrometry has led to the evolution of proprietary binary file formats. In light of these observations, we propose that a common and redistributable application programming interface (API) represents a more viable approach to data access in mass spectrometry. In effect, we propose to shift the burden of standards compliance to the manufacturers’ existing data access libraries. Figure 1 Array Scanners, Telescopes, and Mass Spectrometers: XML, HDF, or API? While our proposal for abstraction through a common API represents a clear departure from current attempts to define a universal file format, it is by no means unique within the broader scientific community. For example, standardized APIs have enabled the development of portable applications in such diverse areas as computer graphics (OpenGL7) and parallel processing (Message Passing Interface, MPI8). More importantly, we believe that a common API will benefit all stakeholders. For example, free redistribution of compiled, vendor-supplied dynamically linked libraries (DLLs) will protect the proprietary layout of native files and provide users with direct and flexible access to data system- and instrument-specific functionality which are typically ignored by lowest common denominator export routines. In addition, we note that mzAPI naturally supports the FDA’s 21 CFR part 11 regulatory requirements for electronic records9 Finally, a community-driven API standard will facilitate manufacturer support of UNIX, in addition to Windows, by highlighting the subset of procedures, from each data system (Xcalibur™, Analyst™, etc.), which need to be ported. Motivated originally by our desire to provide a more intimate environment for flexible and in-depth exploration of mass spectrometry data, particularly from low-throughput experiments, we developed a preliminary common API (mzAPI) – consisting of just five procedures (http://blais.dfci.harvard.edu/mzAPI). To maximize accessibility we exposed mzAPI in the form of a Python library within a flexible, mass-informatics desktop framework called multiplierz (http://blais.dfci.harvard.edu/multiplierz). We are encouraged by results from this test harness, in particular how well mzAPI and our desktop environment support a variety of data analytic operations. Equally impressive is how quickly non-programmers can customize scripts for their specific tasks. Despite success to date in our own lab, we recognize that mzAPI will benefit from further refinement and stress testing. Accordingly, we are actively seeking input from the research community with respect to both concept and implementation of a comprehensive API-based standard for mass spectrometry data access and analysis.
Read full abstract