Self-organizing networks (SONs) can help to manage the severe interference in dense heterogeneous networks (HetNets). Given their need to automatically configure power and other settings, machine learning is a promising tool for data-driven decision making in SONs. In this paper, a HetNet is modeled as a dense two-tier network with conventional macrocells overlaid with denser small cells (e.g. femto or pico cells). First, a distributed framework based on the multi-agent Markov decision process is proposed that models the power optimization problem in the network. Second, we present a systematic approach for designing a reward function based on the optimization problem. Third, we introduce Q-learning-based distributed power allocation algorithm (Q-DPA) as a self-organizing mechanism that enables the ongoing transmit power adaptation as new small cells are added to the network. Furthermore, the sample complexity of the Q-DPA algorithm to achieve $\epsilon $ -optimality with high probability is provided. We demonstrate, at the density of several thousands femtocells per km2, the required quality of service of a macrocell user can be maintained via the proper selection of independent or cooperative learning and appropriate Markov state models.