This paper analyzes the learning behavior of firms in a repeated Cournot oligopoly game. Literature shows the degree of information and cognitive capacity of learning firms is a key factor that determines long run outcome of an oligopoly market. In particular, when firms possess the knowledge of market demand and are capable of computing the optimal production quantity given the output of other firms, the resulting market outcome is the Nash equilibrium. On the other hand, imitation that assumes low behavioral sophistication of firms generally favors higher output and converges to the Walrasian equilibrium. In this paper, a reinforcement learning algorithm with low cognitive requirement is adopted to model firms’ learning behavior. Reinforcement learning firms observe past production choices and fine tune them to improve profits. Analytical result shows that the Nash equilibrium is the only fixed point of the reinforcement learning process. Convergence to the Nash equilibrium is observed in computational simulations. When firms are allowed to imitate the most profitable competitor, all states between the Nash equilibrium and the Walrasian equilibrium can be reached. Furthermore, the long run outcome shifts towards the Nash equilibrium as the length of firms’ memory increases.