Numerous deep learning methods have been developed to tackle the challenges of recognizing food images, including convolutional neural networks, deep feature extraction, and deep feature fusion methods. This research proposes a new architecture called ASTFF-Net that uses deep feature fusion to tackle various challenges in food recognition, including similarity patterns between two categories, multi-object problems, light conditions, camera position, noise objects, and blurred images. ASTFF-Net is a robust and adaptive spatial–temporal fusion network designed to address these challenges effectively. The ASTFF-Net architecture consisted of three networks. In the spatial feature extraction network, the ResNet50 architecture was used to extract robust spatial features, and the reduction operation was utilized to minimize parameter size. Subsequently, the spatial features were passed through a 1D convolution (Conv1D) to fit the features into the recurrent neural networks. In the temporal feature extraction network, the spatial features were given to the long short-term memory, allowing the network to learn from various long sequence patterns. In the adaptive feature fusion network, the robust spatial and temporal features were fused and assigned to the Conv1D, followed by the softmax function. The ASTFF-Net architecture is also intended to decrease the number of network parameters and prevent overfitting problems. Experimental results on four benchmark food image datasets: Food11, UEC Food-100, UEC Food-256, and ETH Food-101, demonstrate that the proposed ASTFF-Nets, particularly ASTFF-NetB3, were more competitive compared with other existing methods.