In this paper, an end-to-end multi-focus fusion network is proposed to solve the mis-registration problem of the captured images. This paper takes the shaking and camera breathing effects that usually exist when acquiring multiple partially focused images into account. The new network consists of a homography pre-correction module, a deformable convolutional fine-correction module, an attention based fusion module and a reconstruction module. The homography pre-correction module is proposed to achieve a coarse correction for multi-focus images. The deformable convolution fine-correction module enables the fusion algorithm to achieve an effective fine alignment for small shaking in multi-focus images. The attention based fusion module and the image reconstruction module are proposed to achieve high definition fusion results. This paper produces a dedicated shaking dataset for training and testing. The algorithm’s superior performance in depth-of-field extension and generalisation to different scenes can be seen through sufficient experiments on test datasets, real scenes and image sequences. The ablation experiments also demonstrate the necessity of each module in the algorithm.