Vision-based hand gesture recognition (HGR) system provides the most effective and natural way of interaction between humans and machines. However, the recognition performance of such an HGR system is challenging due to the variations in illumination, complex backgrounds, the shape of the user’s hand, and inter-class similarity. This work proposes a compact dual-stream dense residual fusion network (DeReFNet) to address the above challenges. The proposed convolutional neural network architecture mainly utilizes the strength of global features from each residual block of the residual stream and spatial information from the other stream using dense connectivity. Both the streams are fused to gather enriched information using the feature concatenation module. The efficacy of the DeReFNet is validated using a subject-independent cross-validation technique on four publicly available benchmark datasets. Furthermore, the qualitative and quantitative analysis of the benchmarked datasets illustrates that the DeReFNet outperforms state-of-the-art methods in terms of accuracy and computational time.