Human sign language is a visual and gestural means of communication used by people with hearing impairments to interact with others. It has the potential to enable interaction between individuals who struggle with verbal communication and those who lack the essential proficiency to comprehend sign language. Wh-questions constitute a substantial portion of daily sign language interactions. Automatic recognition of Wh-question words in sign language, encompassing interrogative words, has received limited attention in gesture recognition. In this context, a novel dataset named the American Question Sign Video dataset (AQSVd) has been introduced as a significant contribution. Recognizing Wh-question signs in video streams empowers individuals to effectively convey messages through hand movements and gestures, fostering inclusivity and accessibility in communication for deaf and hearing-impaired individuals. The paper proposes a Deep Convolutional-3D BiLSTM Multi-head Attention network for recognizing American Wh-Question word sign gestures in video streams. Incorporating Multi-head attention in the proposed model enhances its ability to capture intricate spatial and temporal features, essential for accurate gesture recognition. The proposed model is thoroughly evaluated across a range of datasets, such as INCLUDE50, WLASL-100, and our AQSVd dataset, achieving remarkable results. Specifically, our model demonstrated an outstanding 98.91% validation accuracy on the AQSVd dataset. The C3D-BiLSTM MHAttention model outperforms other state-of-the-art models, demonstrating its superiority in sign language recognition tasks. The proposed dataset and C3D-BiLSTM Multi-head Attention model contribute significantly to this field, offering potential benefits in education, human-robot interaction, and overall communication for the hearing-impaired community.