This study presents an innovative method to optimize convolutional neural networks (CNNs) tailored for the prevalent low-power platforms in IoT and embedded devices, which are integral to smart city infrastructures. Our approach introduces a novel “Rotating Convolutional Kernel” technique, designed to significantly reduce the computational load of CNNs while maintaining high accuracy, an essential feature for the constrained processing capabilities of devices produced by Hisilicon, MediaTek, and Novatek, among others. By leveraging the intelligent video engine (IVE) capabilities inherent in low-end CPUs, our optimized YOLOv3-Tiny model, with fewer than 100 K parameters and specifically fine-tuned for pedestrian and vehicle detection, demonstrates impressive processing speeds of 52 ms/frame on the HI3516EV200 chip and 66 ms/frame on the MSC313E chip, with minimal accuracy compromise. Despite a substantial reduction in parameter count compared to that in MobileNetV2, our model’s top-1 accuracy only slightly decreases by 2.6%, showcasing the effectiveness of our optimization technique. Our findings highlight the potential and applicability of our method in enhancing the performance and utility of IoT and embedded devices within smart cities. By achieving an optimal balance between computational efficiency and detection accuracy, our approach offers a promising avenue for advancing the capabilities of low-power devices in urban surveillance, traffic management, and other smart city applications, thereby contributing to the development of more intelligent, efficient, and responsive urban environments.