In the emerging business model of short video commerce, consumers can engage in short sales videos by liking, commenting, and sharing. Compared with traditional shopping contexts, consumer engagement behavior in short video commerce settings is affected quickly and simultaneously by text, audio, and visual information, but the extant literature lacks sufficient exploration of the effects of such multimodal information. Based on the multimodal theory and signaling theory, this study aims to investigate the role of multimodal information including textual sentiment score, audio spectrum and visual effects of short sales videos on consumer engagement behavior. We collect 4292 veridical short sales videos from short video platform Douyin, while based on a multi-method approach of multiple regression analysis (MRA) and fuzzy set qualitative comparative analysis (fsQCA), it is found that the multimodal information features of short sales videos contribute to the promotion of consumer engagement behavior, and the cross-modal configurational solutions to enhance engagement is not unique. This study bridges the gap in the existing literature by capturing the impact of multimodal information features on consumer engagement behaviors and their asymmetric effects, which also provides practical implications for sellers and marketers on attracting consumers to engage in short video commerce.