A continuing rise in DNN usage in distributed and embedded use cases has demanded more efficient hardware execution in the field. Low-precision GeMMs with optimized data formats have played a key role in more memory and computationally-efficient networks. Recently trending formats are block-scaled representations stemming from tight HW-SW co-optimization, that compress network size by sharing exponents per data block. Prior work mostly focuses on deploying such block-scaled GeMM operations on domain-specific accelerators for optimum efficiency at the cost of flexibility and ease of deployment. In this work, we exploit and optimize the deployment of block-scaled GeMMs on fully-programmable in-order vector processors using ARM SVE. We define a systematic methodology for performing design space exploration to optimally match the workload specifications with processor vector-lengths, different microkernels, block sizes and shapes. We introduce efficient intrinsics-based microkernels with effective loop unrollings, and data-transfer efficient fused requantization strategies to maximize kernel performance, while also ensuring several deployment configurations. We enable generalized block-scaled kernel deployments through tunable block sizes and shapes, which helps in accommodating different accuracy-speed trade-off requirements. Utilizing 2D activation blocks instead of conventional 1D blocks, the static and dynamic BS-INT8 configurations yielded on average 3.8x and 2.9x faster speedups over FP32 models respectively, at no accuracy loss for CNN classification tasks on CIFAR10/100 datasets.