Per-flow spread measurement in high-speed networks has many practical applications. It is a more difficult problem than the traditional per-flow size measurement. Most prior work is based on sketches, focusing on reducing their space requirements in order to fit in on-chip cache memory. This design allows the measurement to be performed at the line rate, but it suffers from expensive computation for spread queries (unsuitable for online operations) and large errors in spread estimation for small flows. This paper complements the prior art with a new spread estimator design based on an on-chip/off-chip model. By storing traffic statistics in off-chip memory, our new design faces a key technical challenge to design an efficient online module of non-duplicate sampling that cuts down the off-chip memory access. We first propose a two-stage solution for non-duplicate sampling, which is efficient but cannot handle well a sampling probability that is either too small or too big. We then address this limitation through a three-stage solution that is more space-efficient. Our analysis shows that the proposed spread estimator is highly configurable for a variety of probabilistic performance guarantees. We implement our spread estimator in hardware using FPGA. The experiment results based on real Internet traffic traces show that our estimator produces spread estimation with much better accuracy than the prior art, reducing the mean relative (absolute) error by about one order of magnitude. Moreover, it increases the query throughput by around three orders of magnitude, making it suitable for supporting online queries in real time.