Monocular depth estimation has been gaining growing momentum in recent years. Despite significant advances of this task, due to the inherent difficulty of reliably capturing contextual cues from RGB images, it remains challenging to accurately predict depth in scenes with complicated and cluttered spatial arrangement of objects. Instead of naively utilizing the primary features in the single RGB image, in this paper we propose a hierarchical object relationship constrained network for monocular depth estimation, which could enable accurate and smooth depth prediction from monocular RGB image. The key idea of our method is to exploit object-centric hierarchical relationship as contextual constraints to compensate for the regularity of spatial depth changing. In particular, we design a semantics-guided CNN network to encode the original image into a global context feature map and encode the objects’ relationship into a local relationship feature map simultaneously, so that we can leverage such effective and consolidated coding scheme over scenario samples to guide the depth prediction in a more accurate way. Benefiting from the local-to-global context constraints, our method can well respect the global depth changing and preserve the local depth details at the same time. In addition, our approach could make full use of the hierarchical semantic relationship across inner-object components and neighboring objects to define depth changing constraints. We conduct extensive experiments and make comprehensive evaluations on widely-used public datasets, and the experiments confirm that our method outperforms most state-of-the-art depth estimation methods in preserving the local details in depth.