Event-related potentials (ERPs) reflect neurophysiological changes of the brain in response to external events and their associated underlying complex spatiotemporal feature information is governed by ongoing oscillatory activity within the brain. Deep learning methods have been increasingly adopted for ERP-based brain-computer interfaces (BCIs) due to their excellent feature representation abilities, which allow for deep analysis of oscillatory activity within the brain. Features with higher spatiotemporal frequencies usually represent detailed and localized information, while features with lower spatiotemporal frequencies usually represent global structures. Mining EEG features from multiple spatiotemporal frequencies is conducive to obtaining more discriminative information. A multiscale feature fusion octave convolution neural network (MOCNN) is proposed in this article. MOCNN divides the ERP signals into high-, medium- and low-frequency components corresponding to different resolutions and processes them in different branches. By adding mid- and low-frequency components, the feature information used by MOCNN can be enriched, and the required amount of calculations can be reduced. After successive feature mapping using temporal and spatial convolutions, MOCNN realizes interactive learning among different components through the exchange of feature information among branches. Classification is accomplished by feeding the fused deep spatiotemporal features from various components into a fully connected layer. The results, obtained on two public datasets and a self-collected ERP dataset, show that MOCNN can achieve state-of-the-art ERP classification performance. In this study, the generalized concept of octave convolution is introduced into the field of ERP-BCI research, which allows effective spatiotemporal features to be extracted from multiscale networks through branch width optimization and information interaction at various scales.