Pulp grade is an important indicator for performance monitoring in froth flotation process. Previous studies have shown that the pulp grade can be predicted more accurately by using the stereo vision information of froth images. However, due to the low intra-class variation among froth images under the identical working conditions and the similarity in shape and texture among bubbles, it is challenging for current binocular image-based methods for grade prediction. Therefore, to accurately predict the key performance indicators in flotation process, a prediction model based on binocular image fusion is proposed in this paper. First, a calculation method of froth image saliency map is proposed, and the saliency map is used as a priori knowledge to guide the prediction model to learn the characteristics of the region of interest in the image. This measure aims to solve the problem of difficulty in grade prediction caused by low intra-class differences in froth images. Then, a multi-scale feature cross-attention fusion network is introduced, wherein the multi-scale features of the left and right views serve as attention mechanism gating signals to extract common feature from binocular image. After that, a pulp grade prediction model is developed based on Video Transformer Network. The results on actual industrial flotation datasets show that, compared with other pulp grade prediction methods, our proposed approach reduces the mean absolute error and root mean square error by 22.84% and 23.55%, respectively.