Abstract

Multi-modal services, typically integrating such signals as audio, video, and haptic, will become an inevitable application trend of the 5G and beyond. However, due to the essential differences among the haptic and audio/video signals, the existing coding schemes usually fail to satisfy the critical requirements in terms of the rate distortion performance. Inspired by the phenomenon that hearing, sight and touch are highly correlated, we provide an affirmative answer by proposing the framework of cross-modal coding, which compresses multi-modal signals aided by their semantic correlation. In particular, the highlights of this work lie in addressing three fundamental technical problems: i) how to exploit the semantic correlation among different modalities, ii) to what extent of benefit we can get from cross-modal coding, and iii) how to design a general cross-modal codec. On the theoretical end, we determine the minimum number of bits required to compress haptic signals under the rate conditions of video streams through investigating their semantic correlation. On the technical end, we design a general cross-modal codec to approach the optimal compression limit by using the AI-enabled cross-modal prediction and channel coding. Numerical results demonstrate that the proposed cross-modal coding can achieve significant benefits relative to the existing schemes, especially when multi-modal signals have strong semantic correlation.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call