This paper addresses the challenge of novel view synthesis for a human performer from a very sparse set of camera views. Some recent works have shown that learning implicit neural representations of 3D scenes achieves remarkable view synthesis quality given dense input views. However, the representation learning will be ill-posed if the views are highly sparse. To solve this ill-posed problem, our key idea is to integrate observations over video frames. To this end, we propose Neural Body, a new human body representation which assumes that the learned neural representations at different frames share the same set of latent codes anchored to a deformable mesh, so that the observations across frames can be naturally integrated. The deformable mesh also provides geometric guidance for the network to learn 3D representations more efficiently. Furthermore, we combine Neural Body with implicit surface models to improve the learned geometry. To evaluate our approach, we perform experiments on both synthetic and real-world data, which show that our approach outperforms prior works by a large margin on novel view synthesis and 3D reconstruction. We also demonstrate the capability of our approach to reconstruct a moving person from a monocular video on the People-Snapshot dataset.