RGB-D food nutrition assessment entails the direct prediction of nutritional content in pairs of RGB and depth food images using signal processing technology. However, existing methods present numerous challenges related to accuracy and computational complexity. In this paper, we introduce the Ingredient-guided Multi-modal Interaction and Refinement Network (IMIR-Net), a novel framework for RGB-D food nutrition assessment. We first utilize a visual grounding model to accurately identify the dish region containing food in the input image with complex semantics. The identified RGB-D dish regions are processed through a multi-modal interaction and refinement module to compute the fused embedding. Concurrently, we introduce an ingredient guidance module to effectively capture ingredient textual information, thereby facilitating the acquisition of the enhanced fused embedding. Finally, we design a decoupled nutrition predictor to predict nutrition from the fused RGB-D embedding. The IMIR-Net can capture the rich multi-modal semantics for RGB-D food nutrition assessment and achieves state-of-the-art performances on the Nutrition5k dataset. Specifically, the percentage of mean absolute error (PMAE) of Calories, Mass, Fat, Carb and Protein reached 14.5%, 10.4%, 21.8%, 20.4% and 20.0%, respectively. Compared with the existing best method, our IMIR-Net demonstrated superior prediction accuracy, with a mean PMAE value of 17.4%, an improvement of 1.1%. In terms of computational complexity, the floating-point operations (FLOPs) and parameters reached 63.87 and 105.11 via our method, which is improved by 5.2% and 7.4%, respectively. The codes, data, and model of our method are publicly available at https://github.com/nianfd/IMIR-Net-DSP.