In the framework of current off-policy actor–critic methods, the state–action pairs in an experience replay buffer (called historical behaviors) cannot be used to improve the policy, and the target network and the clipped double Q-learning techniques need to be used to evaluate the policy. The framework limits the policy learning capability in complex environments, and needs to maintain four critic networks. As a result, we propose an efficient and lightweight off-policy actor–critic (EL-AC) framework. In the policy improvement, an efficient off-policy likelihood ratio policy gradient algorithm with historical behaviors reusing (PG-HBR) is proposed, which promotes the agent to learn an approximately optimal policy by using the historical behaviors. Moreover, a theoretically interpretable universal critic network is designed. It can approximate the action-value and the state-value functions simultaneously, so as to obtain the advantage function in PG-HBR. In the policy evaluation, we develop the algorithms of low-pass filtering for target state-values and adaptive controlling algorithm for overestimation bias, which can evaluate the policy efficiently and accurately using only one universal critic network. Extensive evaluation results indicate that EL-AC outperforms the state-of-the-art off-policy actor–critic methods in terms of approximately optimal policy learning and neural network storage space occupation, and it is more suitable for policy learning in complex environments.