A Mass Spectrometry Proteomics Data Management Platform

Michael Riffle,Vagisha Sharma,Michael J. MacCoss,Jimmy K. Eng

doi:10.1074/mcp.o111.015149

Abstract

Mass spectrometry-based proteomics is increasingly being used in biomedical research. These experiments typically generate a large volume of highly complex data, and the volume and complexity are only increasing with time. There exist many software pipelines for analyzing these data (each typically with its own file formats), and as technology improves, these file formats change and new formats are developed. Files produced from these myriad software programs may accumulate on hard disks or tape drives over time, with older files being rendered progressively more obsolete and unusable with each successive technical advancement and data format change. Although initiatives exist to standardize the file formats used in proteomics, they do not address the core failings of a file-based data management system: (1) files are typically poorly annotated experimentally, (2) files are "organically" distributed across laboratory file systems in an ad hoc manner, (3) files formats become obsolete, and (4) searching the data and comparing and contrasting results across separate experiments is very inefficient (if possible at all). Here we present a relational database architecture and accompanying web application dubbed Mass Spectrometry Data Platform that is designed to address the failings of the file-based mass spectrometry data management approach. The database is designed such that the output of disparate software pipelines may be imported into a core set of unified tables, with these core tables being extended to support data generated by specific pipelines. Because the data are unified, they may be queried, viewed, and compared across multiple experiments using a common web interface. Mass Spectrometry Data Platform is open source and freely available at http://code.google.com/p/msdapl/.

Highlights

From the ‡Department of Genome Sciences, University of Washington, Seattle, Washington 98195; §Department of Biochemistry, University of Washington, Seattle, Washington 98195
In an effort to improve data portability and address this issue of many disparate proprietary file formats, important work has gone into the development of standardized and open data formats
We present the Mass Spectrometery Data Platform (MSDaPl)[1], a proteomics data management system that, instead of driving proteomics workflows, focuses on long-term archiving, searching, evaluating, and performing simple analysis of the data that result from the workflows and may be used to compliment systems such as the LabKey Server (Fig. 1)

Summary

Technological Innovation and Resources

Mass spectrometry-based proteomics is increasingly being used in biomedical research. Attempting to identify the same protein across experiments that used different FASTA files by mapping accession strings from one database to another (or even accession strings between multiple versions of the same database) is an inherently unreliable process (assuming the user even used a FASTA file supported by the mapping) To address this problem, when data are uploaded to MSDaPl these accession strings are mapped to protein identification numbers in the NR_SEQ database by looking up the accession string in the protein reference table for the respective FASTA file. Software Architecture—The software developed for MSDaPl comprises a web application running on top of the databases described above, a back end job queue and data importers designed for uploading MS/MS results to the database, and a FASTA parsing program designed to map FASTA headers to protein identifiers in the database. All software, including source code, is available at the MSDaPl download site at http://code.google.com/p/msdapl/

MSDaPl Web Application

CONCLUSION