The present work aims to improve students’ interest in music teaching and promote modern teaching. A distributed application system of artificial intelligence gesture interactive robot is designed through deep learning technology and applied to music perception education. First, the user’s gesture instruction data is collected through the double channel convolution neural network (DCCNN). It uses the double-size convolution kernel to extract feature information in the image and collect the video frame’s gesture instruction. Secondly, a two-stream convolutional neural network (two-stream CNN) recognizes the collected gesture instruction data. The spatial and temporal information is extracted from RGB color mode (RGB) images and optical flow images and input into the two-stream CNN to fuse the prediction results of each network as the final detection result. Then, the distributed system used by the interactive robot is introduced. This structure can improve the stability of the interactive systems and reduce the requirements for local hardware performance. Finally, experiments are conducted to test the gesture command acquisition and recognition network, and the performance of the gesture interactive robot in practice. The results indicate that combining convolution kernels of [Formula: see text] and [Formula: see text] can increase the recognition accuracy of DCCNN to 98% and effectively collect gesture instruction data. The gesture recognition accuracy of two-stream CNN after training reaches 90%, higher than the mainstream dynamic gesture recognition algorithm trained with the same data set. Finally, the recognition test of gesture instructions is carried out on the gesture interactive robot reported here. The results show that the recognition accuracy of the gesture interactive robots is more than 90%, meeting the routine interaction needs. Therefore, the interactive gesture robot has good reliability and stability and is applicable to music perception teaching. The research reported here has guiding significance for establishing music teaching with multiple perception modes.