Haptic force feedback is an important perception method for humans to understand the surrounding environment. It can estimate tactile force in real time and provide appropriate feedback. It has important research value for robot-assisted minimally invasive surgery, interactive tactile robots, and other application fields. However, most of the existing noncontact visual power estimation methods are implemented using traditional machine learning or 2D/3D CNN combined with LSTM. Such methods are difficult to fully extract the contextual spatiotemporal interaction semantic information of consecutive multiple frames of images, and their performance is limited. To this end, this paper proposes a time-sensitive dual-resolution learning network-based force estimation model to achieve accurate noncontact visual force prediction. First, we perform continuous frame normalization processing on the robot running the video captured by the camera and use the hybrid data augmentation to improve the data diversity; secondly, a deep semantic interaction model is constructed based on the time-sensitive dual-resolution learning network, which is used to automatically extract the deep spatiotemporal semantic interaction information of continuous multiframe images; finally, we construct a simplified prediction model to realize the efficient estimation of interaction force. The results based on the large-scale robot hand interaction dataset show that our method can estimate the interaction force of the robot hand more accurately and faster. The average prediction MSE reaches 0.0009 N, R 2 reaches 0.9833, and the average inference time for a single image is 6.5532 ms; in addition, our method has good prediction generalization performance under different environments and parameter settings.