The accurate classification of bone tumours is crucial for guiding clinical decisions regarding treatment and follow-up. However, differentiating between various tumour types is challenging due to the rarity of certain entities, high intra-class variability, and limited training data in clinical practice. This study proposes a multimodal deep learning model that integrates clinical metadata and X-ray imaging to improve the classification of primary bone tumours. The dataset comprises 1,785 radiographs from 804 patients collected between 2000 and 2020, including metadata such as age, affected bone site, tumour position, and gender. Ten tumour types were selected, with histopathology or tumour board decisions serving as the reference standard.MethodsOur model is based on the NesT image classification model and a multilayer perceptron with a joint fusion architecture. Descriptive statistics included incidence and percentage ratios for discrete parameters, and mean, standard deviation, median, and interquartile range for continuous parameters.ResultsThe mean age of the patients was 33.62 ± 18.60 years, with 54.73% being male. Our multimodal deep learning model achieved 69.7% accuracy in classifying primary bone tumours, outperforming the Vision Transformer model by five percentage points. SHAP values indicated that age had the most substantial influence among the considered metadata.ConclusionThe joint fusion approach developed in this study, integrating clinical metadata and imaging data, outperformed state-of-the-art models in classifying primary bone tumours. The use of SHAP values provided insights into the impact of different metadata on the model’s performance, highlighting the significant role of age. This approach has potential implications for improving diagnostic accuracy and understanding the influence of clinical factors in tumour classification.