The article describes the formulation of the problem of recognition of the movements of objects in a video sequence, the stages of its solution, the analysis of the basic methods of each of the stages. A wide range of applications and growing requirements on the quality of recognition determines the relevance of the study. The process of action recognition and detection begins with extracting useful features, from the input video sequence. Features are then processed through a classier to identify the action class (for example, running, walking, jumping, various gestures). The article describes the main feature descriptors, in the filter-based category: histogram of oriented gradients, cuboid descriptor, scale-invariant feature transform, gradient location-orientation histogram, local trinary patterns, and spatiotemporal patches, optical flow-based descriptors: histograms of optical flow, the motion boundary histogram, dense trajectory, convolutional neural network-based descriptors. Some algorithms require the extraction of primitive features and further refinement of the auxiliary features before they can be passed to the classifier. Examples of the use of specialized primitive features are methods based on silhouettes / contours and methods based on object tracking. There are methods for classifying extracted features, including the following: support vector machines, adaptive boost, artificial neural networks, convolutional neural networks. The key difficulties arising in solving the problem are considered. There are ways to compare various methods. One of the ways to draw comparisons is to quantitatively evaluate each approach on the same database with the same protocol. From simple KTH datasets and Weizmannnd to Carnegie Mellon University Crowded Videos dataset and Microsoft Research Action Group dataset to more complex video conditions and large-scale UCF101 and ActivityNet datasets. Existing approaches to recognition of motion in video sequences are analyzed. The article reveals characteristics, strengths and weaknesses of the various methods of detecting features and their classification. Leading methods that show the best results widely use convolutional neural networks. One of such methods is a spatio-temporal graph convolutional neural network for action recognition based on the object's skeleton. A method for further research and improvement was chosen.