In neural machine translation (NMT), most sequence-to-sequence (seq2seq) models are trained only with the teacher-forcing paradigm, where the ground truth history is used to predict the next ground truth word. At the inference stage, however, the decoder predicts the next token solely based on history generated from scratch. Both using ground truth history and predicting ground truth words potentially lead to exposure bias. On the one hand, to alleviate the issue of exposure bias caused by using ground truth history, we propose contextual augmentation by allowing substitution, insertion, and deletion of words. The contextual augmentation applies to target sequence to generate non-ground truth and natural history when predicting next words. On the other hand, to alleviate the exposure bias caused by predicting ground truth words, we further apply self distillation to guide the model to carry out optimization according to smoothed prediction distribution, i.e, enable the model to predict not only ground truth words, but also other potentially correct and reasonable words. Experimental results on WMT14 English <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\leftrightarrow$</tex-math></inline-formula> German and IWSLT14 German <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"><tex-math notation="LaTeX">$\rightarrow$</tex-math></inline-formula> English translation tasks demonstrate that our approach achieves significant improvements over Transformer on standard benchmarks. Detailed experimental analyses further reveal the effectiveness of our proposed approach on improving the translation quality.