Estimating pixel-wise surface normal from a single image is a challenging task but offers great values to computer vision and robotics applications. By using the spectrally and spatially variant illumination, multispectral photometric stereo can produce pixel-wise surface normal from just one image. But multispectral photometric stereo methods may encounter the tangle problem of illumination, surface reflectance and camera response, which lead to an under-determined system. Existing approaches rely on either extra depth information or material calibration strategies, assuming a Lambertian surface condition which limits their application in practical systems. Previous learning-based methods employ fully-connected or CNN architectures to estimate surface normal. Compared with fully-connected framework, CNN takes advantage of the information embedded in the neighborhood of a surface point, but losing high-frequency surface normal details. In this paper, we present a new method that addresses this task by designing two stacked deep network. We first apply a CNN-based structural cue network to approximate coarse surface normal on small patches. Then, we use a pixel level fully-connected photometric cue network to further refine surface normal details and correct errors from the first step. The fused network is robust to non-Lambertian surfaces and complex illumination environments, such as ambient light and variant light directions. Experimental results show that our dual-cue fused network outperforms existing methods.