Training a dynamically configurable classifier with deep Q-learning

R O Malashin,A A Boiko

doi:10.1364/jot.89.000437

Abstract

Subject of study. We studied dynamic networks capable of performing calculations from input data. Aim. We studied whether deep Q-learning can be used for the construction of dynamic computer vision networks. Methods. In modern dynamically configurable systems, image analysis is typically performed using a policy gradient algorithm. We propose a method for hybrid Q-learning by an image classification agent taking into account limitations on available computer resources. We train the agent to recognize images using a set of pretrained classifiers, and the resulting dynamically configurable system is capable of constructing a computational graph that takes into account the limitations on the number of operations with a trajectory that corresponds to the maximum expected accuracy. The agent only receives an award when the image is correctly recognized within a limit on the number of actions that can be taken by the agent. Experiments were performed using the CIFAR-10 image database and a set of six external classifiers that the agent was trained to control. The experiments performed showed that the standard deep learning method using action values (Deep Q-Network) does not permit the agent to learn strategies that are better than random ones in terms of recognition accuracy. We therefore propose a Q-least-action classifier that approximates the desired classifier selection function by reinforcement learning and the label prediction function by supervised learning. Main results. The trained agent exceeded the recognition accuracy of random strategies (reduces the error by 9.65%). We show that such an agent can make explicit use of information from several classifiers since the accuracy increases when the number of permitted actions increases. Practical significance. Our research shows that the deep Q-learning method is capable of extracting information from sparse responses by classifiers as well as a least-action classifier trained by the policy-gradient method. In addition, the method proposed herein did not require the development of special loss functions.

Full Text