The ultra-low power consumption and flexible configurability of hardware are urgent for resource-constrained artificial intelligence of things (AIoT). Thus, we propose a convolutional engine using tensor multiplication. It consists of a reconfigurable processing element (RPE) array to dynamically adjust convolutional operations with varying kernel sizes during runtime. Implemented in a 22-nm CMOS process, the proposed RPE cluster achieves high energy efficiency, flexibility, and resource utilization with low-cost hardware overhead as compared to state-of-the-art (SOTA) PE architectures. Furthermore, multiply-accumulate (MAC) operations in RPE support accurate and multiple approximate computing modes. Approximate modes achieve a configurable approximation degree, coupled with a search strategy for the approximation factor to improve energy efficiency. In the neural network-based keyword spotting (KWS) task, RPE cluster with the proposed approximation solution saves the whole system's power consumption by 34.64 % with only 0.7 % accuracy loss as compared to using accurate computing modes altogether. It achieves the lowest inference energy as compared to the SOTA.