To develop a deep convolutional neural network (CNN) that enables the prediction of postoperative visual outcomes following epiretinal membrane (ERM) surgery based on preoperative optical coherence tomography (OCT) images and clinical parameters to refine surgical decision-making. A total of 529 patients with idiopathic ERM who underwent standard vitrectomy with ERM peeling surgery by two surgeons between January 1, 2014, and June 1, 2020, were enrolled. The newly developed Heterogeneous Data Fusion Net (HDF-Net) was introduced to predict postoperative visual acuity (VA) outcomes (improvement ≥ 2 lines in Snellen chart or not) 12 months after surgery based on preoperative cross-sectional OCT images and clinical factors, including age, sex, and preoperative VA. The predictive accuracy, sensitivity, specificity, and area under the receiver operating characteristic curve (AUC) of the CNN model were evaluated. The developed model demonstrated an overall accuracy for visual outcome prediction of 88.68% (95% CI, 79.0%-95.7%) with an AUC of 97.8% (95% CI, 86.8%-98.0%), sensitivity of 87.0% (95% CI, 67.9%-95.5%), specificity of 92.9% (95% CI, 77.4%-98.0%), precision of 0.909, recall of 0.870, and F1 score of 0.889. The heatmaps identified the critical area for prediction as the ellipsoid zone of photoreceptors and the superficial retina which was subjected to tangential traction of the proliferative membrane. The novel HDF-Net demonstrated high accuracy in the automated prediction of visual outcomes after weighing and leveraging multiple clinical parameters, including OCT images. This approach may be helpful in establishing personalized therapeutic strategies for ERM management.