Mass-spectrometry-based proteomics frequently utilizes label-free quantification strategies due to their cost-effectiveness, methodological simplicity, and capability to identify large numbers of proteins within a single analytical run. Despite these advantages, the prevalence of missing values (MV), which can impact up to 50% of the data matrix, poses a significant challenge by reducing the accuracy, reproducibility, and interpretability of the results. Consequently, effective handling of missing values is crucial for reliable quantitative analysis in proteomic studies. This study systematically evaluated the performance of selected imputation methods for addressing missing values in proteomic dataset. Two protein identification algorithms, FragPipe and MaxQuant, were employed to generate datasets, enabling an assessment of their influence on im-putation efficacy. Ten imputation methods, representing three methodological categories—single-value (LOD, ND, SampMin), local-similarity (kNN, LLS, RF), and global-similarity approaches (LSA, BPCA, PPCA, SVD)—were analyzed. The study also investigated the impact of data logarithmization on imputation performance. The evaluation process was conducted in two stages. First, performance metrics including normalized root mean square error (NRMSE) and the area under the receiver operating characteristic (ROC) curve (AUC) were applied to datasets with artificially introduced missing values. The datasets were designed to mimic varying MV rates (10%, 25%, 50%) and proportions of values missing not at random (MNAR) (0%, 20%, 40%, 80%, 100%). This step enabled the assessment of data characteristics on the relative effectiveness of the imputation methods. Second, the imputation strategies were applied to real proteomic datasets containing natural missing values, focusing on the true-positive (TP) classification of proteins to evaluate their practical utility. The findings highlight that local-similarity-based methods, particularly random forest (RF) and local least-squares (LLS), consistently exhibit robust performance across varying MV scenarios. Furthermore, data logarithmization significantly enhances the effectiveness of global-similarity methods, suggesting it as a beneficial preprocessing step prior to imputation. The study underscores the importance of tailoring imputation strategies to the specific characteristics of the data to maximize the reliability of label-free quantitative proteomics. Interestingly, while the choice of protein identification algorithm (FragPipe vs. MaxQuant) had minimal influence on the overall imputation error, differences in the number of proteins classified as true positives revealed more nuanced effects, emphasizing the interplay between imputation strategies and downstream analysis outcomes. These findings provide a comprehensive framework for improving the accuracy and reproducibility of proteomic analyses through an informed selection of imputation approaches.
Read full abstract