Cry is an important signal in early infancy for parents to understand needs of their baby and thereby to provide timely parenting/soothing or to be reassured. Thanks to the recent advancement of signal processing, deep learning, and internet-of-things technologies, smart baby monitors with a microphone and/or a video camera have attracted a lot of attention to be used in a baby room to assist parental activities. In this paper, we propose a two-step approach to detect infant cries automatically with continuous audio signals. We first identify and remove the segments without clear sounds (background noise) using a volume-based thresholding algorithm, followed by convolutional neural network (CNN) models to further detect infant cries. The CNN operates on the log linear-scale filterbank energies of audio signals to extract features for cry detection. In this study, a large set of audio data (151.8 hours) collected from five infants in home settings were included. Our proposed approach achieved a mean accuracy of 98.6% in identifying background noise (with only 2 out of 3209 cry segments missed) and a mean accuracy of 92.2% in detecting cries from other non-background sounds.