Data prefetching, which intelligently loads data closer to the processor before demands, is a popular cache performance optimization technique to address the increasing processor-memory performance gap. Although prefetching concepts have been proposed for decades, sophisticated system architecture and emerging applications introduce new challenges. Large instruction windows coupled with out-of-order execution makes the program data access sequence distorted from a cache perspective. Furthermore, big data applications stress memory subsystems heavily with their large working set sizes and complex data access patterns. To address such challenges, this work proposes a high-performance hardware prefetching scheme, SelSMaP. SelSMaP is able to detect both regular and nonuniform stride patterns by taking the minimum observed address offset (called a reference stride) as a heuristic. A stride masking is generated according to the reference stride and is to filter out history accesses whose pattern can be rephrased as uniform stride accesses. Prefetching decision and prefetch degree are determined based on the masking outcome. As SelSMaP prediction logic does not rely on the chronological order of data accesses or program counter information, it is able to unveil the effect of out-of-order execution and compiler optimization. We evaluated SelSMaP with CloudSuite workloads and SPEC CPU2006 benchmarks. SelSMaP achieves an average CloudSuite performance improvement of 30% over nonprefetching systems. With one to two orders of magnitude less storage and much less functional logic, SelSMaP outperforms the highest-performing prefetcher by 8.6% in CloudSuite workloads.