This paper introduces an enhanced variant of the Adam optimizer—the BGE-Adam optimization algorithm—that integrates three innovative technologies to augment the adaptability, convergence, and robustness of the original algorithm under various training conditions. Firstly, the BGE-Adam algorithm incorporates a dynamic β parameter adjustment mechanism that utilizes the rate of gradient variations to dynamically adjust the exponential decay rates of the first and second moment estimates (β1 and β2), the adjustment of β1 and β2 is symmetrical, which means that the rules that the algorithm considers when adjusting β1 and β2 are the same. This design helps to maintain the consistency and balance of the algorithm, allowing the optimization algorithm to adaptively capture the trending movements of gradients. Secondly, it estimates the direction of future gradients by a simple gradient prediction model, combining historic gradient information with the current gradient. Lastly, entropy weighting is integrated into the gradient update step. This strategy enhances the model’s exploratory nature by introducing a certain amount of noise, thereby improving its adaptability to complex loss surfaces. Experimental results on classical datasets, MNIST and CIFAR10, and gastrointestinal disease medical datasets demonstrate that the BGE-Adam algorithm has improved convergence and generalization capabilities. In particular, on the specific medical image gastrointestinal disease test dataset, the BGE-Adam optimization algorithm achieved an accuracy of 69.36%, a significant improvement over the 67.66% accuracy attained using the standard Adam algorithm; on the CIFAR10 test dataset, the accuracy of the BGE-Adam algorithm reached 71.4%, which is higher than the 70.65% accuracy of the Adam optimization algorithm; and on the MNIST dataset, the BGE-Adam algorithm’s accuracy was 99.34%, surpassing the Adam optimization algorithm’s accuracy of 99.23%. The BGE-Adam optimization algorithm exhibits better convergence and robustness. This research not only demonstrates the effectiveness of the combination of these three technologies but also provides new perspectives for the future development of deep learning optimization algorithms.