Multimodal interaction systems combine visual information (involving images, text, sketches and so on) with voice, gestures and other modalities to provide flexible and powerful dialogue approaches, enabling users to choose one or more of the multiple interaction modalities. They break down the barriers in adopting mobile devices for value-added services and the use of integrated multiple input modes enables users to benefit from the natural approach used in human communication. This paper deals with the main features of multimodal interaction and systems, starting from the definition of visual language given in Bottoni et al. (1995) and extending it to multimodality. Modal/multimodal message, interpretation and materialisation functions and multimodal sentence are defined. This paper introduces and formally defines the different classes of cooperation between different modes, introducing the time relationships among the involved modalities and the relationships between chunks of information connected with these modalities.