Q-learning has been primarily used as one of the reinforcement learning (RL) techniques to find the optimal routing path in wireless sensor networks (WSNs). However, for the centralized RL-based routing protocols with a large state space and action space, the baseline Q-learning used to implement these protocols suffers from degradation in the convergence speed, network lifetime, and network energy consumption due to the large number of learning episodes required to learn the optimal routing path. To overcome these limitations, an efficient model-free RL-based technique called Least-Square Policy Iteration (LSPI) is proposed to optimize the network lifetime and energy consumption in WSNs. The resulting designed protocol is a Centralized Routing Protocol for Lifetime and Energy Optimization with a Genetic Algorithm (GA) and LSPI (CRPLEOGALSPI). Simulation results show that the CRPLEOGALSPI has improved performance in network lifetime and energy consumption compared to an existing Centralized Routing Protocol for Lifetime Optimization with GA and Q-learning (CRPLOGARL). This is because the CRPLEOGALSPI chooses a routing path in a given state considering all the possible routing paths, and it is not sensitive to the learning rate. Moreover, while the CRPLOGARL evaluates the optimal policy from the Q-values, the CRPLEOGALSPI updates the Q-values based on the most updated information regarding the network dynamics using weighted functions.
Read full abstract