Semi-supervised Video Object Segmentation (VOS) needs to establish pixel-level correspondences between a video frame and preceding segmented frames to leverage their segmentation clues. Most works rely on features at a single scale to establish those correspondences, e.g., perform dense matching with Convolutional Neural Network (CNN) features from a deep layer. Differently, this work explores complementary features at different scales to pursue more robust feature matching. A coarse feature from a deep layer is first adopted to get coarse pixel-level correspondences. We hence evaluate the quality of those correspondences, and select pixels with low-quality correspondences for fine-scale feature matching. Segmentation clues of previous frames are propagated by both coarse and fine-scale correspondences, which are fused with appearance features for object segmentation. Compared with previous works, this coarse-to-fine matching scheme is more robust to distractions by similar objects and better preserves object details. The sparse fine-scale matching also ensures a fast inference speed. On popular VOS datasets including DAVIS and YouTube-VOS, the proposed method shows promising performance compared with recent works.