Recognizing smoke from surveillance videos is crucial to achieve robust fire detection due to the rich temporal cues in video data. However, the slow-moving smoke makes it difficult to learn complete motion information of smoke. In addition, the widely-used max/average pooling operations in smoke recognition models may lead to the loss of spatial detail information. To solve the above issues, we propose an adaptive frame selection network (AFSNet) with enhanced dilated convolution for video smoke recognition. Specifically, first, to reduce information redundancy for learning discriminative feature representations, we propose an adaptive frame selection convolution to automatically select the most useful frames in the image sequence. Second, to utilize all the pixels for reducing the loss of detail information and carrying over global information, we propose an enhanced dilated convolution by considering the pixels with high responses as well as the local region to learn the contribution of each pixel to the output automatically. Finally, a feature extraction module without any average and max pooling operations is designed to learn multi-scale, context, and spatiotemporal information simultaneously. Experiments show that our model achieves the highest detection rate of 0.9673 and the lowest false alarm rate of 0.0316 on the SRSet dataset, and achieves the highest F1-score accuracy of 0.85, 0.86, and 0.91 on the three subsets of the RISE dataset, respectively.