Prior convolution-based road crack detectors typically learn more abstract visual representation with increasing receptive field via an encoder-decoder architecture. Despite the promising accuracy, progressive spatial resolution reduction causes semantic feature blurring, leading to coarse and incontiguous distress detection. To these ends, an alternative sequence-to-sequence perspective with atransformer network termed TransCrack is introduced for road crack detection. Specifically, an image is decomposed into a grid of fixed-size crack patches, which is flattened with position embedding into a sequence. We further propose a pure transformer-based encoder with multi-head reduced self-attention modules and feed-forward networks for explicitly modelling long-range dependencies from the sequential input in a global receptive field. More importantly, a simple decoder with cross-layer aggregation architecture is developed to incorporate global with local attentions across different regions for detailed feature recovery and pixel-wise crack mask prediction. Empirical studies are conducted on three publicly available damage detection benchmarks. The proposed TransCrack achieves a state-of-the-art performance over all counterparts by a substantialmargin, and qualitative results further demonstrate its superiority in contiguous crack recognition and fine-grained profile extraction. This article is part of the theme issue 'Artificial intelligence in failure analysis of transportation infrastructure and materials'.