The objective of the work described in this paper is the development of an intelligent generation system which is able to combine textual and visual material. As coherent presentations cannot be generated by simply merging verbalization and visualization results into multimedia output, the processes for content determination, medium selection and content realization in different media have to be carefully coordinated. We first show that multimedia presentations and pure text follow similar structuring principles. Based on this insight, we sketch how techniques for planning text and discourse can be generalized to allow the structure and contents of multimedia communications to be planned as well. In particular, we explain how our approach handles the crucial task of process coordination.