The COVID-19 emerged at the end of 2019 and has become a global pandemic. There are many methods for COVID-19 prediction using a single modality. However, none of them predicts with 100% accuracy, as each individual exhibits varied symptoms for the disease. To decrease the rate of misdiagnosis, multiple modalities can be used for prediction. Besides, there is also a need for a self-diagnosis system to narrow down the risk of virus spread in testing centres. Therefore, we propose a robust IoT and deep learning-based multi-modal data classification method for the accurate prediction of COVID-19. Generally, highly accurate models require deep architectures. In this work, we introduce two lightweight models, namely CovParaNet for audio (cough, speech, breathing) classification and CovTinyNet for image (X-rays, CT scans) classification. These two models were identified as the best unimodal models after comparative analysis with the existing benchmark models. Finally, the obtained results of the five independently trained unimodal models are integrated by a novel dynamic multimodal Random Forest classifier. The lightweight CovParaNet and CovTinyNet models attain a maximum accuracy of 97.45% and 99.19% respectively even with a small dataset. The proposed dynamic multimodal fusion model predicts the final result with 100% accuracy, precision, and recall, and the online retraining mechanism enables it to extend its support even in a noisy environment. Furthermore, the computational complexity of all the unimodal models is minimized tremendously and the system functions effectively with 100% reliability even in the absence of any one of the input modalities during testing.