Opportunistic Delay Tolerant Networks also referred to as Opportunistic Networks (OppNets) are a subset of wireless networks having mobile nodes with discontinuous opportunistic connections. As such, developing a performant routing protocol in such an environment remains a challenge. Most research in the literature have shown that reinforcement learning-based routing algorithms can achieve a good routing performance, but these algorithms suffer from under-estimations and/or over-estimations. Toward addressing these shortcomings, in this paper, a Double Q-learning based routing protocol for Opportunistic Networks framework named Off-Policy Reinforcement-based Adaptive Learning (ORAL) is proposed, which selects the most suitable next-hop node to transmit the message toward its destination without any bias by using a weighted double Q-estimator. In the next-hop selection process, a probability-based reward mechanism is involved, which considers the node’s delivery probability and the frequency of encounters among the nodes to boost the protocol’s efficiency. Simulation results convey that the proposed ORAL protocol improves the message delivery ratio by maintaining a trade-off between underestimation and overestimation. Simulations are conducted using the HAGGLE INFOCOM 2006 real mobility data trace and synthetic model, showing that when time-to-live is varied, (1) the proposed ORAL scheme outperforms DQLR by 14.05%, 9.4%, 5.81% respectively in terms of delivery probability, overhead ratio and average delay; (2) it also outperforms RLPRoPHET by 16.17%, 9.2%, 6.85%, respectively in terms of delivery ratio, overhead ratio and average delay.