Modeling and forecasting the evolution of battery systems involve complex interactions across physical, chemical, and electrochemical processes, influenced by diverse usage demands and dynamic operational patterns. In this study, we developed a predictive pre-trained Transformer (PPT) model equipped with 1,871,114 parameters that enhance identification of both short-term and long-term patterns in time-series data. This is achieved through the integration of convolutional layers and probabilistic sparse self-attention mechanisms, which collectively enhance prediction accuracy and efficiency in diagnosing battery health. Moreover, the customized hybrid-model fusion supports parallel computing and employs transfer learning, reducing computational costs while enhancing scalability and adaptability. Consequently, this allows for precise real-time health estimations across various battery cycles. We validated this method using a public dataset of 203 commercial lithium iron phosphate (LFP)/graphite batteries charged at rates ranging from 1C to 8C. By using only partial charge data—from an 80 % state of charge to the maximum charging voltage (3.6 V for LFP batteries, 4.2 V for ternary batteries)—and avoiding complex feature engineering, error metrics were achieved below 0.3 % for root mean square error (RMSE), weighted mean absolute percentage error (WMAPE), and mean absolute error (MAE), with an R2 of 98.9 %. The generalization capabilities were further demonstrated across 36 different testing protocols, encompassing 23,480 cycles throughout the entire life cycle, with a total inference time of 9.88 s during the testing phases. Further experiments on 30 nickel cobalt aluminum (NCA) batteries and 36 nickel cobalt manganese (NCM) batteries, across different battery types and operational scenarios, resulted in RMSE, WMAPE, and MAE all below 0.9 %, with R2 values of 94.1 % and 94.4 %, respectively. These findings highlight the potential of our customized deep transfer neural networks to enhance diagnostic accuracy, accelerate training, and improve generalization in real-time applications.