Objective.Normal tissue complication probability (NTCP) modelling is rapidly embracing deep learning (DL) methods, acknowledging the importance of spatial dose information. Finding effective ways to combine information from radiation dose distribution maps (dosiomics) and clinical data involves technical challenges and requires domain knowledge. We propose different multi-modality data fusion strategies to facilitate future DL-based NTCP studies.Approach.Early, joint and late DL multi-modality fusion strategies were compared using clinical and mandibular radiation dose distribution volumes. These were contrasted with single-modality models: a random forest trained on non-image data (clinical, demographic and dose-volume metrics) and a 3D DenseNet-40 trained on image data (mandibular dose distribution maps). The study involved a matched cohort of 92 osteoradionecrosis cases and 92 controls from a single institution.Main results.The late fusion model exhibited superior discrimination and calibration performance, while the join fusion achieved a more balanced distribution of the predicted probabilities. Discrimination performance did not significantly differ between strategies. Late fusion, though less technically complex, lacks crucial inter-modality interactions for NTCP modelling. In contrast, joint fusion, despite its complexity, resulted in a single network training process which included intra- and inter-modality interactions in its model parameter optimisation.Significance.This study is a pioneering effort in comparing different strategies for including image data into DL-based NTCP models in combination with lower dimensional data such as clinical variables. The discrimination performance of such multi-modality NTCP models and the choice of fusion strategy will depend on the distribution and quality of both types of data. Multiple data fusion strategies should be compared and reported in multi-modality NTCP modelling using DL.