<h3>Purpose/Objective(s)</h3> In image-guided adaptive high-dose-rate brachytherapy of cervical cancer, treatment planning optimization relies on a human-planner to adjust parameters in the optimization problem, such as weights of clinical tumor volume (CTV) and organs at risk (OARs). The resulting plan quality critically depends on these parameters. This human-based planning process can yield suboptimal plans due to issues such as planner inexperience and time pressure. This study develops an automated planning process that employs deep reinforcement learning (DRL) to build a virtual planner trained to autonomously make human-like decisions in adjusting planning parameters to generate high-quality plans that optimize treatment outcome. <h3>Materials/Methods</h3> Plan quality score was defined as tumor control probability (TCP) divided by the average normal tissue complication probability (NTCP) of the bladder, rectum, and sigmoid colon. TCP and NTCP were calculated from dose combining brachytherapy and external beam radiotherapy treatments and based on prior studies investigating dose-response relationships. For the virtual planner, we built deep neural networks that observe CTV D90 and OARs' D2cc and decide on adjustment actions to planning parameters in the optimization engine. We trained this neural network via an end-to-end DRL training process to learn a strategy maximizing a reward function of the plan quality score. Experience replay and an ε-greedy algorithm were implemented. In this process, over 24,000 state-action data pairs generated from 4 patient cases were used for training, and 2 additional patient cases not in training were used for testing. Performance was assessed relative to human-generated plans. <h3>Results</h3> Plans generated by the virtual planner attained qualities surpassing corresponding clinical plan quality 5 of 6 times, attaining 6.50% higher scores on average. This improvement is attributable to improvements in TCP by 1.41% on average, and reduced NTCP by 5.46% on average. A substantial proportion of NTCP improvement stems from reduced rectal toxicity (-7.88% D2cc, -10.17% NTCP), though reduced bladder (-3.26% D2cc, -.02% NTCP) and sigmoid (-1.45% D2cc, -3.44% NTCP) toxicity were obtained as well. <h3>Conclusion</h3> A DRL-based virtual planner was trained to autonomously determine how to operate the treatment planning optimization engine, generating plans of higher clinical quality than human planners. Our study demonstrates the immense potential of DRL-guided approaches to maximize clinical outcomes in treatment planning.