From partial differential equations to the convolutional neural networks in deep learning, to matrix operations in dense linear algebra, computations on structured grids dominate high-performance computing and machine learning. The performance of such computations is key to effective utilization of the billions of US dollar’s worth of GPU-accelerated systems such computations are run on. Concurrently, the end of Moore’s law and Dennard scaling are driving the specialization of compute and memory architectures. This specialization often makes performance brittle (small changes in function can have severe ramifications on performance), non-portable (vendors are increasingly motivated to develop their programming models tailored for their specialized architectures), and not performance portable (even a given computation may perform very differently from one architecture to the next). The mismatch between computations that reference data that is logically neighboring in N-dimensional space but physically distant in memory motivated the creation of Bricks — a novel data-structure transformation for multi-dimensional structured grids that reorders data into small, fixed-sized bricks of contiguously-packed data. Whereas a cache-line naturally captures spatial locality in only one dimension of a structured grid, Bricks can capture spatial locality in three or more dimensions. When coupled with a Python interface, a code-generator, and autotuning, the resultant BrickLib software provides not only raw performance, but also performance portability across multiple CPUs and GPUs, scalability in distributed memory, user productivity, and generality across computational domains. In this paper, we provide an overview of BrickLib and provide a series of vignettes on how it delivers on the aforementioned metrics.