In the complex and dynamic nature of financial markets where increasingly sophisticated investment strategies are required, deep reinforcement learning (DRL) has proven successful in generating real-time investment strategies, outperforming classical models. Alongside this, statistical arbitrage strategies exploit temporary market inefficiencies to generate returns. In this regard, a novel DRL-based arbitrage method has been developed.This paper presents a unified framework that combines classical statistical arbitrage theory with DRL framework to generate investment strategies. The framework addresses the challenges of identifying similar asset portfolios, extracting signals indicating temporary price deviations, and determining optimal trading rules given market conditions. The proposed methodology involves constructing arbitrage portfolios based on cointegration relationships, removing signals from price series and portfolios, and using a DRL agent to make optimal decisions within a fixed time horizon.The empirical analysis focuses on the cryptocurrency market, known for its volatility and risk. Results demonstrate that DRL agents can generate strategies with positive returns in out-of-sample periods 79.52% to 112.82% with no transaction cost, outperforming market benchmark Bitcoin 32.51% return, the best performing over the period Litecoin with 57.11% return and the worst performing Solana with a 35.70% loss. Moreover, these strategies effectively reduce risk, achieving higher risk-adjusted returns on individual assets. The strategies maintain positive returns when considering transaction costs, with the DRL agent outperforming the standard arbitrage strategy. The best-performing strategy is based on a Deep Q-Network (DQN) agent with a return of 18.39%, annualized volatility of 12.22%, and an annualized Sharpe ratio of 2.43. At the same time, Bitcoin holds an annualized volatility of 44.13% and a Sharpe ratio of 1.08.Their randomness and coherence are studied to verify the robustness of the agents’ decisions. The actions generated by the agents are not random but based on well-founded policies, which align with the obtained results.