Currently, micro-videos have gained widespread acceptance as a prominent form of user-generated content across various social media platforms. Accurate event analysis of micro-videos can greatly enhance the many diverse social media platforms applications. Although some studies have shown promising results from a multimodal perspective, there is still a challenge in extracting informative cues from inaccurate modalities, particularly for text modality that is prone to inaccuracies and noise. In this paper, we propose a multimodal semantically enhanced representation network (MSERN) for micro-video event detection. To better address inaccurate and noisy text sentences, we first extract visual concepts in the form of adjective-noun pairs (ANPs), through a fine-grained common representation module, to complement the textual descriptions. To maximize the acquisition of modality-specific cues from both visual and textual modalities, we then implement a coarse-grained private representation module to ensure that private representations encompass unique facets of the modalities beyond the common perspective. Finally, because two modules will collaborate, the fine-grained common and coarse-grained private representations are integrated to ensure a reinforced micro-video representation. We evaluate our proposed method on a micro-video event detection dataset and the experimental results show a superior performance compared to the state-of-the-art methods.