Bridge condition rating is a challenging task as it largely depends on the experience-level of the manual inspection and therefore is prone to human errors. The inspection report often consists of a collection of images and sequences of sentences (text) explaining the condition of the considered bridge. In a routine manual bridge inspection, an inspector collects a set of images and textual descriptions of bridge components and assigns an overall condition rating (ranging between 0 and 9) based on the collected information. Unfortunately, this method of bridge inspection has been shown to yield inconsistent condition ratings that correlate with inspector experience. To improve the consistency among image-text inspection data and further predict the accordant condition ratings, this study first provides a collective image-text dataset, extracted from the collection of bridge inspection reports from the Virginia Department of Transportation. Using this dataset, we have developed novel deep learning-base methods for an automatic bridge condition rating prediction based on data fusion between the textual and visual data from the collected report sets.Our proposed multi modal deep fusion approach constructs visual and textual representations for images and sentences separately using appropriate encoding functions, and then fuses representations of images and text to enhance the multi-modal prediction performance of the assigned condition ratings. Moreover, we study interpretations of the deployed deep models using saliency maps to identify parts of the image-text inputs that are essential in condition rating predictions. The findings of this study point to potential improvements by leveraging consistent image-text inspection data collection as well as leveraging the proposed deep fusion model to improve the bridge condition prediction rating from both visual and textual reports.
Read full abstract