Abstract
The presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data. Imputation has been widely utilized to handle MVs, and selection of the proper method is critical for the accuracy and reliability of imputation. Here we present a comparative study that evaluates the performance of seven popular imputation methods with a large-scale benchmark dataset and an immune cell dataset. Simulated MVs were incorporated into the complete part of each dataset with different combinations of MV rates and missing not at random (MNAR) rates. Normalized root mean square error (NRMSE) was applied to evaluate the accuracy of protein abundances and intergroup protein ratios after imputation. Detection of true positives (TPs) and false altered-protein discovery rate (FADR) between groups were also compared using the benchmark dataset. Furthermore, the accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. We observed that the accuracy of imputation is primarily affected by the MNAR rate rather than the MV rate, and downstream analysis can be largely impacted by the selection of imputation methods. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amount of TPs with the average FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics.
Highlights
The presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data
Key applications of label-free proteomics include the discovery of biomarkers and new drug targets, but a major issue is that the power of statistical inference and downstream functional analysis is greatly impacted by the presence of missing values (MVs) in the protein abundance data
Our results revealed that the random forest (RF) and local least squares (LLS) imputation methods consistently performed better than other methods, and RF slightly outperformed LLS in terms of protein ratio estimation and DE protein detection
Summary
The presence of missing values (MVs) in label-free quantitative proteomics greatly reduces the completeness of data. The accuracy of handling real MVs was assessed by comparing enriched pathways and signature genes of cell activation after imputing the immune cell dataset. A random forest-based imputation method consistently outperformed other popular methods by achieving the lowest NRMSE, high amount of TPs with the average FADR < 5%, and the best detection of relevant pathways and signature genes, highlighting it as the most suitable method for label-free proteomics. Key applications of label-free proteomics include the discovery of biomarkers and new drug targets, but a major issue is that the power of statistical inference and downstream functional analysis is greatly impacted by the presence of missing values (MVs) in the protein abundance data. Global structure methods, have been introduced to proteomics because they can handle mixed types of MVs3,5
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.