This paper addresses the problem of event detection and localization in long football (soccer) videos. Our key idea is that understanding the long-range dependencies between video frames is imperative for accurate event localization in long football videos. Additionally, proper event detection is not likely for fast movements in football videos without considering mid-range and short-range correlations between neighboring video frames. We argue that event spotting can be considerably improved by considering short-range to long-range frame dependencies in a unified architecture. To model long-range and mid-range dependencies, we propose to use the dilated recurrent neural network (DilatedRNN) with long short-term memory (LSTM) units, grounded on two-stream convolutional neural network (Two-stream CNN) features. While two-stream CNN extracts local spatiotemporal features necessary for fine-level details, the DilatedRNN makes the information obtained from distant frames available for the classifier and spotting algorithms. Evaluating our event spotting algorithm on the largest publicly available benchmark football dataset –SoccerNet– shows an accuracy improvement of 0.8% - 13.6% compared to state of the art, and up to 30.1% accuracy gain in comparison to the baselines. We also investigate the contribution of each neural network component in spotting accuracy through an extensive ablation study.