Abstract The deep Q learning method is adopted to transform the microgrid model into the Markov decision model to realize the coordinated management and energy optimization of the microgrid. For the problem of the Markov decision model, we cleverly divide the working state of the gas turbine into different gears to meet its requirements. To reduce the fitting difficulty and improve the result accuracy and convergence stability, a linear penalty term was added to the reward function. Considering the operating and environmental costs, in addition to the purchase and sale price of a large power grid, peak and valley electricity price and operation constraints, we further reduce the peak and valley difference and reduce the operating cost. Finally, the output of the battery and the output of the distributed power supply are obtained through simulation verification. Through the peak-shifting and valley-filling function of the battery, it can effectively deal with the fluctuation of peak and valley electricity prices, reducing the peak and valley difference of electric heating load, and providing strong support for sustainable energy management.