We develop a new model for tensor completion which incorporates noisy side information available on the rows and columns of a 3-dimensional tensor. This method learns a low rank representation of the data along with regression coefficients for the observed noisy features. Given this model, we propose an efficient alternating minimization algorithm to find high-quality solutions that scales to large data sets. Through extensive computational experiments, we demonstrate that this method leads to significant gains in out-of-sample accuracy filling in missing values in both simulated and real-world data. We consider the problem of imputing drug response in three large-scale anti-cancer drug screening data sets: the Genomics of Drug Sensitivity in Cancer (GDSC), the Cancer Cell Line Encyclopedia (CCLE), and the Genentech Cell Line Screening Initiative (GCSI). On imputation tasks with 20% to 80% missing data, we show that the proposed method TensorGenomic matches or outperforms state-of-the-art methods including the original tensor model and a multilevel mixed effects model. With 80% missing data, TensorGenomic improves the R^2 from 0.404 to 0.552 in the GDSC data set, 0.407 to 0.524 in the CCLE data set, and 0.331 to 0.453 in the GCSI data set compared to the tensor model which does not take into account genomic side information.
Read full abstract