AbstractThe demand for less complex and more accurate architectures has always been a priority since the broad usage of computer vision in everyday life, like auto‐drive cars, portable applications, augmented reality systems, medical image analysis etc. There are a lot of methods that have been developed to improve the accuracy and complexity of object detection, like the generations of R‐CNNs and YOLOs. However, these methods are not the most efficient architectures, and there is always room to improve. In this study, the 5th version of YOLO is employed and the improved architecture, Inception‐YOLO, is presented. The model significantly outperforms the SOTA YOLO family. Specifically, the improvements can be summarised as follows: impressive improvement of floating point operations (FLOPs) and number of parameters, as well as improvement in accuracy compared to the models with fewer FLOPs. All our presented approaches, like the optimized inception module, proposed structures for CSP and SPPF, and the improved loss function used in this research, work together to incrementally improve detection results, accuracy, demanded memory, and FLOPs simultaneously. For a glimpse of performance, the Inception‐YOLO‐S model hits 38.7% AP with 5.9M parameters and 11.5 BFLOPs and outperforms YOLOv5‐S with 37.4% AP, 7.2M parameters, and 16.5 BFLOPs.