Missing data pose significant challenges during representation learning of observational data. The incompleteness of data can result in a deterioration of generative performance in disentangled representation learning. Conventional data imputation solutions, such as regression imputation or multiple imputation, often neglect the underlying causal relationships among the data. To address these issues, the causal disentangled representation learning for missing data (CDRM) framework was proposed. Missing data with graph representations are integrated to construct heterogeneous networks composed of observations, features, and known feature values. To achieve data completeness, an interaction module consisting of a parallel neighbor interaction layer and an embedded update layer are integrated with the heterogeneous network to predict missing values. To recover the true causal relationship of missing data, edge embeddings are further introduced during message passing of heterogeneous networks, which can capture the interaction of different features and enrich the representation of observations. Furthermore, the causal relationship is incorporated into the VAE using two distinct encoders to learn representations of causally related concepts. In a series of experimental evaluations on diverse datasets, CDRM consistently outperforms the state-of-the-art method, namely, CausalVAE, in disentangled representation learning, particularly in scenarios with limited labeled data. Notably, CDRM can generate counterfactual data in response to missing data, further enhancing its utility in machine learning applications. The source code and data are available at https://github.com/Causal-Disentangled/CDRM.
Read full abstract