This paper deals with the relevance of the first impression of interactive systems, based on short passive visual/auditory stimuli of system output. Individual consistency between such impressions and retrospective user ratings, obtained directly after real interaction, is studied in four exploratory experiments. All systems allow for voice user input. Two systems are considered to be multimodal as they support additional input other than speech (e.g., gesture); whereas the other two systems, offering speech as the sole input modality, are multimedia systems. The first impression of the four systems is based on screen-shots of typical display views and selected prompts of the systems’ speech output. Measures used here were pragmatic quality (i.e., the functional aspects of a system such as efficiency and effectiveness that are closely related to the concept of usability) and hedonic qualities (i.e., the systems non-instrumental aspects such as its ability to provide stimulation and identification—to evoke the psychological well-being of the user. It was tested, whether consistency found for web-sites can also be found for speech-based systems. In our case, this consistency was assessed not between systems, but within systems. Results indicate that users’ first impression of system output does correlate with ratings collected after the interaction for each of the four systems. For the two truly multimodal systems, ratings after single input (e.g., only voice, only touch screen) also correlates with ratings of a multimodal interaction with the same system. This result confirms data from literature. However, our assumption of lower correlations for the first impression of pragmatic quality, expected due to its experience-based character, is not supported. Instead, pragmatic quality seems to represent a construct with low consistency in general. Reasons for this might be found in the benefit of pragmatic quality experienced during multimodal interaction that is neither covered by unimodal interaction, nor predictable from a first impression. Additional multiple regression analysis for the two systems with multiple input modalities show that the first impression of the visual system output can complement predictors from the single modality interactions to model post-usage multimodal ratings. However, which of the output channels has a relevant impact was found to be highly system dependent.