The fit information of multi-modal user-generated content has its unique value but is understudied. This study investigated how the fit between photos and text within a hotel review influences tourist rating behavior. Employing deep learning technologies and econometric analysis shows that the fit between photos and text has an inverted U-shape effect on tourists’ ratings using online hotel review data from Qunar.com. The dispersion of ratings and the number of reviews can weaken the above effect; this is due to the varying level of cognitive resources tourists possess. Endogeneity bias from text and photos is excluded by conducting the two-stage residual analysis. The information uncertainty reduction mechanism is tested between photo-text fit and ratings. This paper provides valuable insights for practitioners regarding user-generated content management in hospitality industry.