Abstract This paper presents a novel scene text detection method based on conditional random field (CRF) framework. We estimate the confidence of Maximally Stable Extremal Region (MSER) being text by leveraging convolutional neural network (CNN) to define the unary cost item. In addition, we establish the neighboring interactions for MSERs using four different features including color, shape, stroke and spatial features to define the pairwise cost item. Considering the special layout of texts appearing in natural scene images, we employ context information to recover missing text MSER candidates. Furthermore, text MSERs are grouped into candidate text lines which are verified with shape-specific classifiers by integrating gray and binary features. Experimental results on four public benchmark datasets show that the proposed method achieves the comparable performance.