Many data sources, such as human poses, lie on low-dimensional manifolds that are smooth and bounded. Learning low-dimensional representations for such data is an important problem. One typical solution is to utilize encoder-decoder networks. However, due to the lack of effective regularization in latent space, the learned representations usually do not preserve the essential data relations. For example, adjacent video frames in a sequence may be encoded into very different zones across the latent space with holes in between. This is problematic for many tasks such as denoising because slightly perturbed data have the risk of being encoded into very different latent variables, leaving output unpredictable. To resolve this problem, we first propose a neighborhood geometric structure-preserving variational autoencoder (SP-VAE), which not only maximizes the evidence lower bound but also encourages latent variables to preserve their structures as in ambient space. Then, we learn a set of small surfaces to approximately bound the learned manifold to deal with holes in latent space. We extensively validate the properties of our approach by reconstruction, denoising, and random image generation experiments on a number of data sources, including synthetic Swiss roll, human pose sequences, and facial expression images. The experimental results show that our approach learns more smooth manifolds than the baselines. We also apply our approach to the tasks of human pose refinement and facial expression image interpolation where it gets better results than the baselines.