Hand pose estimation has recently emerged as a compelling topic in the robotic research community, because of its usefulness in learning from human demonstration or safe human–robot interaction. Although deep learning-based methods have been introduced for this task and have shown promise, it remains a challenging problem. To address this, we propose a novel end-to-end architecture for hand pose estimation using red-green-blue (RGB) and depth (D) data (RGB-D). Our approach processes the two data sources separately and utilizes a dense fusion network with an attention module to extract discriminative features. The features extracted include both spatial information and geometric constraints, which are fused to vote for the hand pose. We demonstrate that our voting mechanism in conjunction with the attention mechanism is particularly useful for solving the problem, especially when hands are heavily occluded by objects or are self-occluded. Our experimental results on benchmark datasets demonstrate that our approach outperforms state-of-the-art methods by a significant margin.