Transformers deep learning models for missing data imputation: an application of the ReMasker model on a psychometric scale

Monica Casella,Nicola Milano,Pasquale Dolce,Davide Marocco

doi:10.3389/fpsyg.2024.1449272

Abstract

IntroductionMissing data in psychometric research presents a substantial challenge, impacting the reliability and validity of study outcomes. Various factors contribute to this issue, including participant non-response, dropout, or technical errors during data collection. Traditional methods like mean imputation or regression, commonly used to handle missing data, rely upon assumptions that may not hold on psychological data and can lead to distorted results.MethodsThis study aims to evaluate the effectiveness of transformer-based deep learning for missing data imputation, comparing ReMasker, a masking autoencoding transformer model, with conventional imputation techniques (mean and median imputation, Expectation–Maximization algorithm) and machine learning approaches (K-nearest neighbors, MissForest, and an Artificial Neural Network). A psychometric dataset from the COVID distress repository was used, with imputation performance assessed through the Root Mean Squared Error (RMSE) between the original and imputed data matrices.ResultsResults indicate that machine learning techniques, particularly ReMasker, achieve superior performance in terms of reconstruction error compared to conventional imputation techniques across all tested scenarios.DiscussionThis finding underscores the potential of transformer-based models to provide robust imputation in psychometric research, enhancing data integrity and generalizability.

Full Text