This paper presents a novel method for accurate people counting in highly dense crowd images. The proposed method consists of three modules: extracting foreground regions (EF), pixel-wise attention mechanism (PAM) and single-column density map estimator (S-DME). EF can suppress the disturbance of complex background efficiently with a fully convolutional network, PAM performs pixel-wise classification of crowd images to generate high-quality local crowd density maps, and S-DME is a carefully designed single-column network that can learn more representative features with much fewer parameters. In addition, two new evaluation metrics are introduced to get a comprehensive understanding of the performance of different modules in our algorithm. Experiments demonstrate that our approach can get the state-of-the-art results on several challenging datasets including our dataset with highly cluttered environments and various camera perspectives.