Extracting specific attributes of a face within an image, such as emotion, age, or head pose has numerous applications. As one of the most widely used vision-based attribute extraction models, HPE (Head Pose Estimation) models have been extensively explored. In spite of the success of these models, the pre-processing step of cropping the region of interest from the image, before it is fed into the network, is still a challenge. Moreover, a significant portion of the existing models are problem-specific models developed specifically for HPE. In response to the wide application of HPE models and the limitations of existing techniques, we developed a multi-purpose, multi-task model to parallelize face detection and pose estimation (i.e., along both axes of yaw and pitch). This model is based on the Mask-RCNN object detection model, which computes a collection of mid-level shared features in conjunction with some independent neural networks, for the detection of faces and the estimation of poses. We evaluated the proposed model using two publicly available datasets, <i>Prima</i> and <i>BIWI</i>, and obtained MAEs (Mean Absolute Errors) of 8.0 ± 8.6, and 8.2 ± 8.1 for yaw and pitch detection on <i>Prima</i>, and 6.2 ± 4.7, and 6.6 ± 4.9 on <i>BIWI</i> dataset. The generalization capability of the model and its cross-domain effectiveness was assessed on the publicly available dataset of <i>UTKFace</i> for face detection and age estimation, resulting a MAE of 5.3 ± 3.2. A comparison of the proposed model’s performance on the domains it was tested on reveals that it compares favorably with the state-of-the-art models, as demonstrated by their published results. We provide the source code of our model for public use at: <uri>https://github.com/kahroba2000/MTL_MRCNN</uri>.