Catheter-based cardiac ablation is a minimally invasive procedure for treating atrial fibrillation (AF). Electrophysiologists perform the procedure under image guidance during which the contact force between the heart tissue and the catheter tip determines the quality of lesions created. This paper describes a novel multi-modal contact force estimator based on Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs). The estimator takes the shape and optical flow of the deflectable distal section as two modalities since frames and motion between frames complement each other to capture the long context in the video frames of the catheter. The angle between the tissue and the catheter tip is considered a complement of the extracted shape. The data acquisition platform measures the two-degrees-of-freedom contact force and video data as the catheter motion is constrained in the imaging plane. The images are captured via a camera that simulates single-view fluoroscopy for experimental purposes. In this sensor-free procedure, the features of the images and optical flow modalities are extracted through transfer learning. Long Short-Term Memory Networks (LSTMs) with a memory fusion network (MFN) are implemented to consider time dependency and hysteresis due to friction. The architecture integrates spatial and temporal networks. Late fusion with the concatenation of LSTMs, transformer decoders, and Gated Recurrent Units (GRUs) are implemented to verify the feasibility of the proposed network-based approach and its superiority over single-modality networks. The resulting mean absolute error, which accounted for only 2.84% of the total magnitude, was obtained by collecting data under more realistic circumstances in contrast to previous research studies. The decrease in error is considerably better than that achieved by individual modalities and late fusion with concatenation. These results emphasize the practicality and relevance of utilizing a multimodal network in real-world scenarios.