In the promising Artificial Intelligence of Things technology, deep learning algorithms are implemented on edge devices to process data locally. However, high-performance deep learning algorithms are accompanied by increased computation and parameter storage costs, leading to difficulties in implementing huge deep learning algorithms on memory and power constrained edge devices, such as smartphones and drones. Thus various compression methods are proposed, such as channel pruning. According to the analysis of low-level operations on edge devices, existing channel pruning methods have limited effect on latency optimization. Due to data processing operations, the pruned residual blocks still result in significant latency, which hinders real-time processing of CNNs on edge devices. Hence, we propose a generic deep learning architecture optimization method to achieve further acceleration on edge devices. The network is optimized in two stages, Global Constraint and Start-up Latency Reduction, and pruning of both channels and residual blocks is achieved. Optimized networks are evaluated on desktop CPU, FPGA, ARM CPU, and PULP platforms. The experimental results show that the latency is reduced by up to 70.40%, which is 13.63% higher than only applying channel pruning and achieving real-time processing in the edge device.