The state monitoring of the railway track line is one of the important tasks to ensure the safety of the railway transportation system. While the defect recognition result, that is, the inspection report, is the main basis for the maintenance decision. Most previous attempts have proposed intelligent detection methods to achieve rapid and accurate inspection of the safety state of the railway track line. However, there are few investigations on the automatic generation of inspection reports. Fortunately, inspired by the recent advances and successes in dense captioning, such technologies can be investigated and used to generate textual information on the type, position, status, and interrelationship of the key components from the field images. To this end, based on the work of DenseCap, a railway track line image captioning model (RTLCap for short) is proposed, which replaces VGG16 with ResNet-50-FPN as the backbone of the model to extract more powerful image features. In addition, towards the problems of object occlusion and category imbalance in the field images, Soft-NMS and Focal Loss are applied in RTLCap to promote defect description performance. After that, to improve the image processing speed of RTLCap and reduce the complexity of the model, a reconstructed RTLCap model named Faster RTLCap is presented with the help of YOLOv3. In the encoder part, a multi-level regional feature localization, mapping, and fusion module (MFLMF) are proposed to extract regional features, and an SPP (Spatial Pyramid Pooling) layer is employed after MFLMF to reduce model parameters. As for the decoder part, a stacked LSTM is adopted as the language model for better language representation learning. Both quantitative and qualitative experimental results demonstrate the effectiveness of the proposed methods.