The subcellular localization of a protein is important for its function and interaction with other molecules, and its mislocalization is linked to numerous diseases. While atlas-scale efforts have been made to profile protein localization across various cell lines, existing datasets only contain limited pairs of proteins and cell lines which do not cover all human proteins. We present a method that uses both protein sequences and cellular landmark images to perform Predictions of Unseen Proteins' Subcellular localization (PUPS), which can generalize to both proteins and cell lines not used for model training. PUPS combines a protein language model and an image inpainting model to utilize both protein sequence and cellular images for protein localization prediction. The protein sequence input enables generalization to unseen proteins and the cellular image input enables cell type specific prediction that captures single-cell variability. PUPS' ability to generalize to unseen proteins and cell lines enables us to assess the variability in protein localization across cell lines as well as across single cells within a cell line and to identify the biological processes associated with the proteins that have variable localization. Experimental validation shows that PUPS can be used to predict protein localization in newly performed experiments outside of the Human Protein Atlas used for training. Collectively, PUPS utilizes both protein sequences and cellular images to predict protein localization in unseen proteins and cell lines with the ability to capture single-cell variability.
Read full abstract