Given a text-image pair, Multimodal Named Entity Recognition (MNER) is the task of identifying and categorizing entities in the text. Most existing work performs named entity labeling directly using final token representations derived by fusing image and text representations. Although they achieve promising results, these work may fail to effectively exploit text and image modalities. This is because they neglect the difference in the role of the two modalities: text modality can detect the boundary of an entity, while image modality is introduced to disambiguate the category of the entity. Based on these findings, in this paper, we construct two auxiliary tasks based on the decomposition strategy and propose a multi-task framework for MNER. Specifically, we first decompose MNER into two auxiliary tasks: entity boundary detection task and entity category classification task. Here, the former treats only the text modality as input and outputs the boundary labels, since it can achieve satisfactory boundary results by itself. The latter uses two modalities to yield category labels where image modality is dedicated to disambiguating categories. These two auxiliary tasks allow the effective exploitation of text and image modalities and put them back into their respective roles. Then, we vectorize their results to improve entity recognition using label clues from auxiliary tasks. Finally, we fuse features from text and image modalities and label embeddings from auxiliary tasks to fulfill MNER. Experimental results on two widely used MNER datasets show that our framework can yield new SOTA performance.