Depth estimation is crucial in several computer vision applications, and a recent trend in this field aims at inferring such a cue from a single camera. Unfortunately, despite the compelling results achieved, state-of-the-art monocular depth estimation methods are computationally demanding, thus precluding their practical deployment in several application contexts characterized by low-power constraints. Therefore, in this paper, we propose a lightweight Convolutional Neural Network based on a shallow pyramidal architecture, referred to as <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$\mu $ </tex-math></inline-formula> PyD-Net, enabling monocular depth estimation on microcontrollers. The network is trained in a peculiar self-supervised manner leveraging proxy labels obtained through a traditional stereo algorithm. Moreover, we propose optimization strategies aimed at performing computations with quantized 8-bit data and map the high-level description of the network to low-level layers optimized for the target microcontroller architecture. Exhaustive experimental results on standard datasets and an in-depth evaluation with a device belonging to the popular Arm Cortex-M family confirm that obtaining sufficiently accurate monocular depth estimation on microcontrollers is feasible. To the best of our knowledge, our proposal is the first one enabling such remarkable achievement, paving the way for the deployment of monocular depth cues onto the tiny end-nodes of distributed sensor networks.