Deep learning-driven transmission line inspection is a critical area for smart power grid development. Despite advances in deep learning for insulator defect detection, challenges remain in model robustness and adaptability for the varying real-world adaptability, especially for insignificant defects in complex backgrounds. This study presents a comprehensive improvement strategy for detecting insulators and cross-scale broken defects on transmission lines, employing a proposal-based detection model. The model introduces a holistic pipeline of improved methods, including backbone modification, anchor box scale recalibration, and improvements in Region of Interest (RoI) downsampling alignment and Intersection over Union (IoU) loss function. Various backbone networks, including convolutional network (ConvNet) and Vision Transformer (ViT) structures, are constructed and integrated with attention modules, specifically designed to amplify the perception of insulators and defective regions. The geometric scale of anchor boxes is reconstructed using a developed clustering algorithm, considering the elongated characteristics of insulator strings to improve the adaptability of anchor boxes. Bilinear interpolation is utilized to mitigate spatial misalignment issues during the downsampling process of Region Proposal Network (RPN)-based proposals. The experimental results indicate that the improved models with the Swin Transformer (Swin-T) backbone framework achieve the mean Average Precision (mAP)@0.5 of 88.42% and mAP@0.7 of 60.52%, with a defect recall rate of 81.94%. Additionally, the improved IoU loss function contributes to the performance of the model at higher IoU thresholds. The results of this study contribute to the further development of defect detection frameworks for power vision applications.