Ultra-reliable and Low-latency Communications (URLLC) is expected to be one of the most critical characteristics Beyond fifth-Generation (B5G) cellular networks with stringent low latency and high-reliability requirements. The Deep Reinforcement Learning (deep-RL) framework has been applied to predict the optimization of a Resource Block (RB) and minimize Power Allocation (PA) to guarantee a high End-to-End (E2E) reliability and low E2E latency under rate constraints. This paper proposes a novel Policy Gradient-based Actor-Critic Learning (PGACL) algorithm to optimize the policy gradient for optimal rate allocation to solve the RB, minimize power, and guarantee a solution for URLLC scheduling. The purpose of a PGACL algorithm is to provide a good policy with a closer convergence rate and a low computational cost depending on the reduced action space for every user. URLLC systems need to operate in highly reliable systems and account for extreme network conditions. Therefore, we proposed the refiner Generative Adversarial Networks (GANs) that apply enough extreme events for the deep-RL agent to generate synthetic data with high reliability similar to real data based on the regulated number of extreme events in the dataset. This refiner GAN method enables a deep-RL approach to generate large amounts of data practically used in real-time operations. Simulation results showed that the proposed deep-RL for refiner-GAN can omit the transient training time and develop deep learning based on a controlled set of unlabeled real traffic at a relatively short time. Furthermore, the refiner GAN demonstrated 99.9999% reliability and E2E latency of less than 1.4ms.