Emotion cause extraction (ECE) seeks to find out what causes a given emotion, which has drawn much attention in natural language and signal processing. Conventional ECE normally focuses on single level (i.e., either word-level or clause-level) in a document scenario. However, as we known, single level ECE can not satisfy the wide applications, compared with both levels. Besides, the existing dialogue systems increasingly need empathy support. Therefore, in this paper, we propose to hierarchically extract both word and utterance-level emotion causes in the spoken dialogue scenario (HECE). We first construct two datasets (i.e., HECE-DD and HECE-IE) based on previous studies, then propose a hierarchical framework, which consists of a feature extractor, an utterance-level cause extractor, and a word-level cause extractor. In this framework, utterance and word levels can naturally preform interaction. Detailed experiments on two HECE datasets demonstrate that hierarchical extraction performs better than extraction on a single level.