With the global outbreak of COVID-19, an increasing number of countries have made imported epidemic control a priority, imposing restriction measures to prevent the spread of the virus caused by imported cases. To control the imported epidemic, it is necessary to accurately predict the number of imported cases from different source countries. This paper proposes a novel time series prediction approach called PNICA (Prediction on Number of Imported CAses) that uses deep learning to predict the number of COVID-19 imported cases. On the one hand, the proposed PNICA approach adopts a multi-modal learning strategy to fuse three sources of data: flight data, the epidemic data, and the data of historical imported cases. On the other hand, the proposed PNICA approach extends the traditional transformer model with cross-modal attention to learn the interactions between different data modalities to improve prediction accuracy. We use China as the target country and collect the number of imported cases from four source countries—Japan, USA, Russia, and the UK—as well as the epidemic data and flight data from May to November 2020. Experiments on the collected data demonstrate that the proposed PNICA approach outperforms the baseline methods in predicting the number of imported cases. The ablation study shows that both the multi-modal learning strategy and cross-modal attention can significantly improve prediction performance.