Depth maps captured by Kinect or time of flight (ToF) cameras have an active role in many visual applications. However, a brutal truth is that these depth maps are often contaminated with compound noise, which includes intrinsic noise and missing pixels. In addition, depth maps captured with ToF-based cameras are low in resolution. As these depth maps carry rich and critical information about 3D space, high quality post-processing is crucial for supporting subsequent visual applications. Previous works were proposed via the guiding of the registered color image and bicubic interpolation as an initialization for the up-sampling task, where challenges arose from texture coping and blurry depth discontinuities. Motivated by these challenges, in this paper, we propose a new optimization model depending on the relative structures of both depth and color images for both depth map filtering and up-sampling tasks. In our general model, two self-structure priors for depth and color images are constructed individually and used for the two tasks. For overcoming the texture coping problem, the color-based and depth-based priors are used near the depth edges and at the homogeneous regions respectively. To this end, we further propose a confidence map at every task for managing where every prior is used. Experimental results on both simulated and real datasets for Kinect and ToF cameras demonstrate that the proposed method has a superior performance than benchmarks.