Micrometre level ranging accuracy between satellites on-orbit relies on the high-precision calibration of the antenna phase center (APC), which is accomplished through properly designed calibration maneuvers batch estimation algorithms currently. However, the unmodeled perturbations of the space dynamic and sensor-induced uncertainty complicated the situation in reality; ranging accuracy especially deteriorated outside the antenna main-lobe when maneuvers performed. This paper proposes an on-orbit APC calibration method that uses a reinforcement learning (RL) process, aiming to provide the high accuracy ranging datum for onboard instruments with micrometre level. The RL process used here is an improved Temporal Difference advantage actor critic algorithm (TDAAC), which mainly focuses on two neural networks (NN) for critic and actor function. The output of the TDAAC algorithm will autonomously balance the APC calibration maneuvers amplitude and APC-observed sensitivity with an object of maximal APC estimation accuracy. The RL-based APC calibration method proposed here is fully tested in software and on-ground experiments, with an APC calibration accuracy of less than 2 mrad, and the on-orbit maneuver data from 11–12 April 2022, which achieved 1–1.5 mrad calibration accuracy after RL training. The proposed RL-based APC algorithm may extend to prove mass calibration scenes with actions feedback to attitude determination and control system (ADCS), showing flexibility of spacecraft payload applications in the future.