The feasibility of age estimation is explored using the ultrasound tongue image of the speakers. Motivated by the success of deep learning, a deep convolutional neural network model is trained on the UltraSuite dataset. The deep model achieves mean absolute error (MAE) of 2.03 years for the data from typically developing children, while MAE is 4.87 for the data from the children with speech sound disorders, which suggest that age estimation using ultrasound is more challenging for the children with speech sound disorder. Also, we explore to visualize what does the deep model learn for the age estimation task. We firstly visualize the convolutional layers in the learned convolutional neural networks. We observe that the deep model not only focuses on the contour in the ultrasound tongue image, but also pays more attention to the regions corresponding to the tendon and tongue root regions, which may provide guidance for future ultrasound tongue imaging interpretation tasks. The developed method can be used a tool to evaluate the performance of speech therapy sessions.