Recent works have valid the chance of increase the energy potency in radio access networks (RANs). done by dynamically turning on/off some base stations (BSs). According to this paper, to extend the analysis over Base Stations switching operations that ought to match up with traffic load variations. . we have a tendency to first of all formulate the traffic variations as a Markov Decision Process. After that minimize the energy consumption of RANs, and to design a reinforcement learning framework primarily based BSs switching operation scheme. Furthermore, to speed up the continued learning method, a transfer actor-critic algorithmic program (TACT), that utilizes the transferred learning experience in historical periods or neighbouring regions, is planned and demonstrably converges. The planned considerateness algorithmic program contributes to a performance jumpstart and demonstrates the feasibleness of great energy potency improvement at the expense of tolerable delay performance. Index Terms: Radio access networks, base stations, green communications, energy saving, reinforcement learning, transfer learning, actor-critic algorithm. I. Introduction: Wireless cellular networks are growing chop-chop within the previous few decades. The subscriber range and traffic volume in cellular networks have explosively increased. The Base station (BS) transmits common management signals and information signals to mobile users (MUs). Network designing, cell size and capability area unit typically mounted supported the estimation of peak traffic load. For a cellular network in an exceedingly town, the traffic load within the day time is comparatively significant in workplace areas and lightweight in residential areas, whereas the alternative things happen within the evening. The massive range of BSs contribute a significant portion of the energy consumption of cellular networks. once a SB is in its operating mode, the energy consumption of process circuits and cooling system takes up concerning 60percent of the overall consumption the data and communication technology (ICT) business accounts for twenty-four to100% of the world's overall power consumption presently, over80% of the ability consumption takes place within the radio access networks (RANs), particularly the bottom stations (BSs). Luca Chiaraviglioet alshowed the likelihood of energy saving by simulations. And projected a way to dynamically regulate the operating standing of Bs, counting on the anticipated traffic hundreds. However, to dependably predict the traffic masses remains quite difficult, that makes these works suffering in sensible applications. On the opposite hand, and conferred dynamic Bs switch algorithms with the traffic masses a previous and preliminarily proved the effectiveness of energy saving. Besides, it's conjointly found that turning on/off a number of the BSs can directly have an effect on the associated BS of a mobile terminal (MT). Moreover, decisions of user associations successively result in the traffic load variations of BSs. Hence, any 2 consecutive Bss witch operations are related with one another and current Bs switch operation also will additional influence the general energy consumption within the long haul. In alternative words, the expected energy saving theme should be farsighted whereas minimizing the energy consumption. It ought to concern its result on each the present and future system performance to deliver a visionary BSs witch operation answer. The authors in conferred a partly farsighted energy saving theme which mixes Bs switch operation and user association, by giving a heuristic answer on the premise of a stationary traffic load profile. During this paper, we tend to try and solve this drawback from a unique perspective rather than predicting the quantity of traffic masses, we tend to apply a Markov call method to model the traffic load variations. Afterwards, the answer to the developed MDP drawback is earned by creating use of actor-critic algorithmic program a reinforcement learning (RL) approach actor-critic algorithm , a reinforcement learning (RL) approach ,one advantage of that is that there's no necessity to possess a previous information temporal and spacial connection within the traffic hundreds and hurrying the on-going learning method in regions of interest as delineated in Fig one. As a result, the training frame work of Bs shift operation is more increased by in cooperating the concept of TL into the classical actor-critic rule namely the Transfer Actor-Critic rule during