With the continuous development of urbanization, the service life of sewer pipes is gradually approaching a critical threshold. Defects within pipe networks can significantly affect the municipality operations and residents' quality of life. Towards efficient and automatic sewer pipeline inspection, an integrated framework was proposed for semantic segmentation and severity quantification of multiple sewer pipe defects using the PipeTransUNet model that fuses convolutional neural networks with Transformer. The capability of the fusion network to extract and localize sewer defects was further enhanced by incorporating the convolutional block attention module and improving the activation function. In extensive experiments, PipeTransUNet shows enough competitiveness after hyperparameter selection, architecture tweaks, and comparison with other state-of-the-art models. Specifically, PipeTransUNet outperformed other models in terms of both quantitative and visual evaluations, with mean intersection over union, the mean of pixel accuracy, mean pixel accuracy, mean recall, mean F1-score, mean specificity, mean Kappa, and frequency-weighted intersection over union values reaching 71.92%, 84.90%, 80.74%, 85.93%, 83.05%, 84.44%, 55.55%, and 91.32%, respectively. A severity level assessment method for different sewer defects was developed based on PipeTransUNet and compared with expert reviews' results to demonstrate its feasibility and effectiveness. Moreover, the in-depth image features extracted from the segmentation head of our proposed model were visually evaluated and interpreted using pixel-level gradient-weighted class activation mapping. In summary, PipeTransUNet canrecognize complex defects well and provide a solid foundation for inspecting and maintaining sewer pipe networks.