While vision transformers have been highly successful in improving the performance in image-based tasks, not much work has been reported on applying transformers to scene text recognition due to the complexities in the visual appearance of multi-scale texts. To fill the gap, this paper proposes an adaptive n-gram transformer for multi-scale scene text recognition (ANT-STR). In ANT-STR, an adaptive n-gram embedding that is able to automatically determine the optimal size of each image patch is designed to fully explore the potential semantic correlations between neighboring visual patches, which is essential for feature extraction from multi-scale scene texts. On top of the adaptive n-gram embedding, a patch-based n-gram attention mechanism is introduced into ANT-STR to further process the feature maps for multi-scale texts. In addition, the loss function is rectified to take into account both multi-scale character-based identification and contextual coherence scoring. Comparative studies are conducted on five widely used benchmark datasets and a new multi-scale scene text dataset collected from tourism scenes in Indonesia. Our experimental results demonstrate that ANT-STR performs considerably better compared to the state-of-the-art, especially in handling complex multi-scale scene texts.