Time series extrinsic regression (TSER) aims at predicting numeric values based on the knowledge of the entire time series. The key to solving the TSER problem is to extract and use the most representative and contributed information from raw time series. To build a regression model that focuses on those information suitable for the extrinsic regression characteristic, there are two major issues to be addressed. That is, how to quantify the contributions of those information extracted from raw time series and then how to focus the attention of the regression model on those critical information to improve the model's regression performance. In this article, a multitask learning framework called temporal-frequency auxiliary task (TFAT) is designed to solve the mentioned problems. To explore the integral information from the time and frequency domains, we decompose the raw time series into multiscale subseries in various frequencies via a deep wavelet decomposition network. To address the first problem, the transformer encoder with the multihead self-attention mechanism is integrated in our TFAT framework to quantify the contribution of temporal-frequency information. To address the second problem, an auxiliary task in a manner of self-supervised learning is proposed to reconstruct the critical temporal-frequency features so as to focusing the regression model's attention on those essential information for facilitating TSER performance. We estimated three kinds of attention distribution on those temporal-frequency features to perform auxiliary task. To evaluate the performances of our method under various application scenarios, the experiments are carried out on the 12 datasets of the TSER problem. Also, ablation studies are used to examine the effectiveness of our method.