Scene classification, which consists of assigning images with semantic labels by exploiting the local spatial arrangements and structural patterns inside tiled regions, is a key problem in the automatic interpretation of optical high-spatial resolution remote sensing imagery. Many state-of-the-art methods, e.g., the bag-of-visual-words model and its variants, the topic models and unsupervised feature learning-based approaches, share similar procedures: patch sampling, feature learning and classification. Patch sampling is the first and a key procedure, and it has a considerable influence on the results. In the literature, many different sampling strategies have been used, e.g., random sampling and saliency-based sampling. However, the sampling strategy that is most suitable for the scene classification of optical high-spatial resolution remote sensing images remains unclear. In this paper, we comparatively study the effects of different sampling strategies under the scenario of scene classification of optical high-spatial resolution remote sensing images. We divide the existing sampling methods into two types: random sampling and saliency-based sampling. Here, we consider the commonly-used grid sampling to be a specific type of random sampling method, and the saliency-based sampling consists of keypoint-based sampling and salient region-based sampling. To compare their performances, we rely on a standard bag-of-visual-words model to learn the global features for testing because of its simplicity, robustness and efficiency. In addition, we conduct experiments using a Fisher kernel framework to validate our conclusions. The experimental results obtained on two commonly-used datasets using different feature learning methods show that random sampling can provide comparable and even better performance than all of the saliency-based strategies.