In this study, we examine variables affecting follower attraction and determine the optimal model for its prediction. We identify 12 variables from the persuasion attempt perspective in the persuasion knowledge model and accordingly perform two progressive studies (study 1: knowledge driven, with coarse-grained variables, with unimodal models, without data fusion; study 2: knowledge-and-data driven, with fine-grained features, with multimodal models, with data fusion). In study 1, we compute 12 variables in accordance with hypotheses from nonfused trimodal data (verbal, vocal, and visual). XGBoost and Shapley additive explanations are used to evaluate the importance of these 12 variables, and economic models are adopted to test their statistical significance. In study 2, we reduce the data granularity of the 12 coarse-grained variables by generating 323 fine-grained features. We then process trimodal data fusion to consider intra- and intermodal interactions. Long Short-Term Memory (LSTM) models are applied to learn a mapping from the input sequences of trimodality to output values of follower increment. The results demonstrate that early fusion LSTM with fused trimodal features has a 66.3% prediction accuracy, outperforming LSTM with unimodal features and late fusion LSTM with nonfused trimodal features. To our knowledge, these results are among the first to unfold follower attraction strategies with unstructured video data, and this is the first study to provide knowledge-and-data-driven multimodal machine learning models for predicting follower attraction at different granularity levels.