The separation of the data capture and analysis in modern vision systems has led to a massive amount of data transfer between the end devices and cloud computers, resulting in long latency, slow response, and high power consumption. Efficient hardware architectures are under focused development to enable Artificial Intelligence (AI) at the resource-limited sensing devices. One of the most promising solutions is to enable Processing-in-Pixel (PIP) scheme. However, the conventional schemes suffer from the low fill-factor issue. This paper proposes a PIP based Complementary Metal-Oxide-Semiconductor (CMOS) sensor architecture, which allows convolution operation before the column readout circuit to significantly reduce the overall power consumption while improving the resource utilization of the succeeding deep learning accelerator. The simulation results show that the proposed architecture could support the computing efficiency up to 3.37 TOPS/W at the 8-bit weight configuration, which is four times as high as the conventional schemes after normalization. The transistors required for each pixel are only 3.5T, significantly improving the fill-factor.