There are two types of voice conversion methods: statistical and deep learning-based. Although statistical methods can train with limited data, they face challenges, including spectral oversmoothing and time-domain discontinuity. On the other hand, extensively researched deep learning-based methods rely primarily on massive amounts of data, which limits their practical applicability. Given that voice conversion is an engineering problem with limited training data, it is crucial to develop techniques that can produce satisfactory results in terms of quality and similarity in the absence of a large amount of data.This paper proposes a voice conversion model based on stochastic variational deep kernel learning (SVDKL), which works with limited training data. The model allows the use of both the deep neural network’s expressive capability and the high flexibility of the Gaussian process, which is a Bayesian and non-parametric method. The model utilizes a cascade of a deep neural network and a conventional kernel as the covariance function, which enables it to estimate non-smooth and more complex functions. Furthermore, the model’s sparse variational Gaussian process solves the scalability problem of exact inference and enables the learning of a global mapping function for the entire acoustic space. One of the most important aspects of the proposed scheme is that the model parameters are trained using marginal likelihood optimization, which takes into account both data fitting and model complexity. Considering model complexity reduces the training data by increasing the robustness to overfitting. To evaluate the proposed scheme, we examined the model’s performance with as little as approximately 80 s of training data. The results indicated that our method obtains a higher mean opinion score, smaller spectral distortion, and better preference tests than the state-of-the-art limited data methods.