Abstract In this paper we propose to combine speech-based and linguistic classification in order to obtain better emotion recognition results for user spoken utterances. Usually these approaches are considered in isolation and even developed by different communities working on emotion recognition and sentiment analysis. We propose modeling the users emotional state by means of the fusion of the outputs generated with both approaches, taking into account information that is usually neglected in the individual approaches such as the interaction context and errors, and the peculiarities of transcribed spoken utterances. The fusion approach allows to employ different recognizers and can be integrated as an additional module in the architecture of a spoken conversational agent, using the information generated as an additional input for the dialog manager to decide the next system response. We have evaluated our proposal using three emotionally-colored databases and obtained very positive results.