With the continuous development of technology, smart home control systems, as one of the applications of artificial intelligence in daily life, are receiving increasing attention. However, smart home control systems face challenges in practical applications, such as complex scenarios and multimodal information. In this context, this paper introduces Visual Question Answering (VQA) technology to enhance the intelligence and user experience of smart home control systems. Despite the potential advantages of VQA technology in smart home control systems, current models still have certain shortcomings in handling local image information and integrating visual-language multimodal features. To address this, this paper proposes an innovative Transformer-based Multimodal Fusion Network (TMFNet) model. TMFNet aims to overcome the limitations of existing models in dealing with complex smart home scenarios by introducing a global-local feature attention mechanism, deep encoding-decoding modules, and multimodal representation modules.