DC analysis is essential and still quite challenging in large-scale nonlinear circuit simulation. Pseudo transient analysis (PTA) is a widely-used and has great potential solver in the industry. However, the PTA convergence and simulation efficiency is still seriously affected by its stepping policy. This paper proposes an online stochastic stepping policy (OSSP) for PTA based on deep reinforcement learning (DRL). To achieve better policy evaluation and stronger stepping exploration ability, the dual soft Actor-Critic agents work with the proposed valuation splitting and online momental scaling, enabling our OSSP to intelligently encode PTA iteration status and online further adjust forward and backward time-step size for unseen test circuits without human intervention and domain knowledge, trained solely by RL from self-search. Our public sample buffer and priority sampling are also introduced to overcome the sparsity and imbalance of sample data. Numerical examples demonstrate that the proposed OSSP achieves a significant efficiency speedup (up to 47.0X less NR iterations) and convergence enhancement on unseen test circuits compared with the previous iter-based and SER-based stepping methods, in just one stepping iteration.