The conventional phase retrieval wavefront sensing approaches mainly refer to a series of iterative algorithms, such as G-S algorithms, Y-G algorithms and error reduction algorithms. These methods use intensity information to calculate the wavefront phase. However, most of the traditional phase retrieval algorithms are difficult to meet the real-time requirements and depend on the iteration initial value used in iterative transformation or iterative optimization to some extent, so their practicalities are limited. To solve these problems, in this paper, a phase-diversity phase retrieval wavefront sensing method based on wavelet transform image fusion and convolutional neural network is proposed. Specifically, the image fusion method based on wavelet transform is used to fuse the point spread functions at the in-focus and defocus image planes, thereby simplifying the network inputs without losing the image information. The convolutional neural network (CNN) can directly extract image features and fit the required nonlinear mapping. In this paper, the CNN is utilized to establish the nonlinear mapping between the fusion images and wavefront distortions (represented by Zernike polynomials), that is, the fusion images are taken as the input data, and the corresponding Zernike coefficients as the output data. The network structure of the training in this paper has 22 layers, they are 1 input layer, 13 convolution layers, 6 pooling layers, 1 flatten layer and 1 full connection layer, that is, the output layer. The size of the convolution kernel is 3 × 3 and the step size is 1. The pooling method selects the maximum pooling and the size of the pooling kernel is 2 × 2. The activation function is ReLU, the optimization function is Adam, the loss function is the MSE, and the learning rate is 0.0001. The number of training data is 10000, which is divided into three parts: training set, validation set, and test set, accounting for 80%, 15% and 5% respectively. Trained CNN can directly output the Zernike coefficients of order 4–9 to a high precision, with these fusion images serving as the input, which is more in line with the real-time requirements. Abundant simulation experiments prove that the wavefront sensing precision is root-mean-square(RMS) 0.015<i>λ</i>, when the dynamic range of the wavefront is the aberration of low spatial frequency within 1.1<i>λ</i> of RMS value (i.e. the dynamic range of Zernike coefficients of order 4–9 is <inline-formula><tex-math id="M600">\begin{document}$[- 0.5\lambda \,, \, 0.5\lambda]$\end{document}</tex-math><alternatives><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="5-20201362_M600.jpg"/><graphic xmlns:xlink="http://www.w3.org/1999/xlink" xlink:href="5-20201362_M600.png"/></alternatives></inline-formula>). In practical application, according to the system aberration characteristics, the number of network output layer units can be changed and the network structure can be adjusted based on the method presented in this paper, thereby training the new network suitable for higher order aberration to realize high-precision wavefront sensing. It is also proved that the proposed method has certain robustness against noise, and when the relative defocus error is within 7.5%, the wavefront sensor accuracy is acceptable. With the improvement of image resolution, the wavefront sensing accuracy is improved, but the number of input data of the network also increases with the sampling rate increasing, and the time cost of network training increases accordingly.