The emerging computer vision and deep learning technologies are being applied to the intelligent analysis of sports training videos. In this paper, a deep learning based fine-grained action recognition (FGAR) method is proposed to analyze soccer training videos. The proposed method was applied to indoor training equipment for evaluating whether a player has stopped a soccer ball successfully or not. First, the problem of FGAR is modeled as human-object (player-ball) interactions. The object-level trajectories are proposed as a new descriptor to identify fine-grained sports videos. The proposed descriptor can take advantage of high-level semantic and human-object interaction motions. Second, a cascaded scheme of deep networks based on the object-level trajectories is proposed to realize FGAR. The cascaded network is constructed by concatenating a detector network with a classifier network (a long-short-term-memory (LSTM)-based network). The cascaded scheme takes the advantage of the high efficiency of the detector on object detection and the outstanding performance of the LSTM-based network on processing time series. The experimental results show that the proposed method can achieve an accuracy of 93.24%.