Nowadays the recommendation is an important part of human life, it is used for industrial, business, and even academic purposes. Currently, various recommendation techniques are provided such as content-based ltering [11], matrix factorization [6], Naive Bayes classi- er [1,12], logistic regression [1], and multi-armed bandit approaches [7,18]. Unfortunately, all of them are considering the recommendation as a static process, assuming that user preferences are not changing during the period of time and they are not taking into consideration that user interactions with the system are sequential. These methods and algorithms are widely used by giant companies like Amazon [9], Netix [4, 15] and Google [2]. The article demonstrates using machine learning and especially reinforcement learning for items recommendation based on positive interactions of users with the system. This approach, unlike others, is dynamic, otherwise, it is adapted to user preferences changes and allows to focus on maximizing the benets of user interaction with recommendation system. Besides that reinforcement learning also is focused on long-term reward and understands that user interactions with the system are sequential. The article describes an example of using this approach to recommend movies based on user positive ratings provided to that movies, but speaking in general this technique can also be used for any type of information the only necessary thing is a representation of these elements to work with the neural network interface. The environment of the proposed recommendation system is represented as Markov Decision Process [19] in which the user interacts with the recommendation agent and that agent creates proposed items for a specic user. Neural network architecture is built using the ¾Actor-Critic¿ learning conceptual model [19] which combines two methods of learning: Policy-learning as of ¾Actor¿ network and Q-learning as of ¾Critic¿ network. ¾Actor¿ network is taking users' positive interaction with the system as an input and generates approximate recommendation. On the other hand ¾Critic¿ is taking that approximate Ðîìàíþê Á., Ïåëþøêåâè÷ Î., Ùåðáèíà Þ. 13 recommendation, concatenate it with positive interaction, and calculate Q-value to determine how good will be that approximate recommendation. The combination of these two learning techniques considers both dynamic adaptations to user preferences changes and long-term interactions which lead to fast performance and better recommendations are proposed. The main purpose of this process is to maximize reward, which means that the proposed recommendation will be positive for both user and system. Inside the system, movies are represented as an array of numeric elements to work properly with the neural network interface. Movies representation are created according to specic characteristics of each one, such as year of production, director, genres, country of production, companies, rating, popularity, budget, etc. To combine all those properties principal component analysis is used. Besides that to represent user ratings there was developed ¾State representation module¿, which combines rated movies and numeric rates itself and represents them in the same way as an array of digits. 10 thousand movies and 1 million rates from a dataset called MovieLens were used to train the model, from which 95% of ratings were used to train the model and the other 5% for accurate calculation. The proposed approach demonstrates great convergence up to 98% on big data sets and does not take a big amount of time to train the neural network model. The proposed methodology combines the dynamic approach of generating recommendations understanding user actions as a sequential process and concentrating on getting a long-term reward. The recommendation in such a way is very perspective in the future and can be used by big companies to maximize their prots from sealing dierent types of things or information.