As deep learning has produced dramatic breakthroughs in many areas, it has motivated emerging studies on the combination between neural networks and cache replacement algorithms. However, deep learning is a poor fit for performing cache replacement in hardware implementation because its neural network models are impractically large and slow. Many studies have tried to use the guidance of the Belady algorithm to speed up the prediction of cache replacement. But it is still impractical to accurately predict the characteristics of future access addresses, introducing inaccuracy in the discrimination of complex access patterns. Therefore, this paper presents the LSTM-CRP algorithm as well as its efficient hardware implementation, which employs the long short-term memory (LSTM) for access pattern identification at run-time to guide cache replacement algorithm. LSTM-CRP first converts the address into a novel key according to the frequency of the access address and a virtual capacity of the cache, which has the advantages of low information redundancy and high timeliness. Using the key as the inputs of four offline-trained LSTM network-based predictors, LSTM-CRP can accurately classify different access patterns and identify current cache characteristics in a timely manner via an online set dueling mechanism on sampling caches. For efficient implementation, heterogeneous lightweight LSTM networks are dedicatedly constructed in LSTM-CRP to lower hardware overhead and inference delay. The experimental results show that LSTM-CRP was able to averagely improve the cache hit rate by 20.10%, 15.35%, 12.11% and 8.49% compared with LRU, RRIP, Hawkeye and Glider, respectively. Implemented on Xilinx XCVU9P FPGA at the cost of 15,973 LUTs and 1610 FF registers, LSTM-CRP was running at a 200 MHz frequency with 2.74 W power consumption.