Motion deblurring can be advanced by exploiting informative features from supplementary sensors such as event cameras, which can capture rich motion information asynchronously with high temporal resolution. Existing event-based motion deblurring methods neither consider the modality redundancy in spatial fusion nor temporal cooperation between events and frames. To tackle these limitations, a novel spatial-temporal collaboration network (STCNet) is proposed for event-based motion deblurring. Firstly, we propose a differential-modality based cross-modal calibration strategy to suppress redundancy for complementarity enhancement, and then bimodal spatial fusion is achieved with an elaborate cross-modal co-attention mechanism to weight the contributions of them for importance balance. Besides, we present a frame-event mutual spatio-temporal attention scheme to alleviate the errors of relying only on frames to compute cross-temporal similarities when the motion blur is significant, and then the spatio-temporal features from both frames and events are aggregated with the custom cross-temporal coordinate attention. Extensive experiments on both synthetic and real-world datasets demonstrate that our method achieves state-of-the-art performance. Project website: https://github.com/wyang-vis/STCNet.