This paper presents an IC implementation of on-chip learning neural network accelerator using highly linear CMOS-compatible floating gate charge trap devices. A simple learning algorithm utilizing winner-take-all and competitive learning is proposed to design fast and power-efficient hardware. This algorithm was analyzed with behavioral model of emerging non-volatile memory via MATLAB. The linearity, symmetry, and cycle-to-cycle variation of multi-bit switching characteristic affects training accuracy. The proposed content-aware programming technique of modulated column line driver provides flexibility for real-time training while maintaining device linearity, despite having to update a different step for every unit cell and training. The prototype IC is embedded in the process-in-memory structure for energy efficient computing, in which cell arrays were divided into 4 sub-blocks to reduce I-R drop. The prototype IC fabricated using 180nm CMOS technology consumes 353.3pJ and 898.2pJ during inference and training mode, which corresponds 95.05TOPS/W and 38.03 TOPS/W, respectively. The fully integrated non-volatile AI IC with on-chip solution is demonstrated with throughput of 1343.2 GOPS.