Since the multi-modal services, which elaborately integrate audio, video, and haptic streams, are able to substantially improve the users' immersive experience, they are gradually becoming the mainstream applications. However, the existing solutions fail to simultaneously satisfy the diverse transmission requests of multi-modal services in terms of lower latency, higher reliability, and higher throughput, especially in the resource-constrained and dynamic environment. To circumvent this dilemma, this work constructs a cross-modal stream transmission architecture by exploring the intra-modal redundancy and the inter-modal correlation. Specifically, we propose an adaptive stream scheduling strategy, which can not only substantially decrease the stream traffics, but also dramatically reduce the impact of the channel uncertainty and the haptic signals' stochastic arrival on the delay jitter and reliability. Moreover, through prioritizing the resource provision of haptic signals, we design a joint uplink and downlink resource allocation scheme based on the prediction model. In particular, the transmitter sends the predicted haptic signals via the allocated resource blocks in advance according to the optimal window size, so as to offset the latency and remain the symmetry during the bilateral haptic transmission. Meanwhile, we develop an audio and video stream transmission scheme in a hierarchical and reusing structure by integrating the power-domain NOMA with LMDC-FEC, which fully utilizes the remaining limited resources to maximize the robustness of audio-video transmission.