Statistics in Proteomics: A Meta-analysis of 100 Proteomics Papers Published in 2019.

David C L Handler,Paul A Haynes

doi:10.1021/jasms.9b00142

Abstract

We randomly selected 100 journal articles published in five proteomics journals in 2019 and manually examined each of them against a set of 13 criteria concerning the statistical analyses used, all of which were based on items mentioned in the journals' instructions to authors. This included questions such as whether a pilot study was conducted and whether false discovery rate calculation was employed at either the quantitation or identification stage. These data were then transformed to binary inputs, analyzed via machine learning algorithms, and classified accordingly, with the aim of determining if clusters of data existed for specific journals or if certain statistical measures correlated with each other. We applied a variety of classification methods including principal component analysis decomposition, agglomerative clustering, and multinomial and Bernoulli naïve Bayes classification and found that none of these could readily determine journal identity given extracted statistical features. Logistic regression was useful in determining high correlative potential between statistical features such as false discovery rate criteria and multiple testing corrections methods, but was similarly ineffective at determining correlations between statistical features and specific journals. This meta-analysis highlights that there is a very wide variety of approaches being used in statistical analysis of proteomics data, many of which do not conform to published journal guidelines, and that contrary to implicit assumptions in the field there are no clear correlations between statistical methods and specific journals.

Full Text