Wheat yield and grain protein content (GPC) are two main optimization targets for breeding and cultivation. Remote sensing provides nondestructive and early predictions of yield and GPC, respectively. However, whether it is possible to simultaneously predict yield and GPC in one model and the accuracy and influencing factors are still unclear. In this study, we made a systematic comparison of different deep learning models in terms of data fusion, time-series feature extraction, and multitask learning. The results showed that time-series data fusion significantly improved yield and GPC prediction accuracy with R2 values of 0.817 and 0.809. Multitask learning achieved simultaneous prediction of yield and GPC with comparable accuracy to the single-task model. We further proposed a two-to-two model that combines data fusion (two kinds of data sources for input) and multitask learning (two outputs) and compared different feature extraction layers, including RNN (recurrent neural network), LSTM (long short-term memory), CNN (convolutional neural network), and attention module. The two-to-two model with the attention module achieved the best prediction accuracy for yield (R2 = 0.833) and GPC (R2 = 0.846). The temporal distribution of feature importance was visualized based on the attention feature values. Although the temporal patterns of structural traits and spectral traits were inconsistent, the overall importance of both structural traits and spectral traits at the postanthesis stage was more important than that at the preanthesis stage. This study provides new insights into the simultaneous prediction of yield and GPC using deep learning from time-series proximal sensing, which may contribute to the accurate and efficient predictions of agricultural production.