The integration of Face Verification (FV) systems into multiple critical moments of daily life has become increasingly prevalent, raising concerns regarding the transparency and reliability of these systems. Consequently, there is a growing need for FV explainability tools to provide insights into the behavior of these systems. FV explainability tools that generate visual explanations, e.g., saliency maps, heatmaps, contour-based visualization maps, and face segmentation maps, show promise in enhancing FV transparency by highlighting the contributions of different face regions to the FV decision-making process. However, evaluating the performance of such explainability tools remains challenging due to the lack of standardized assessment metrics and protocols. In this context, this paper proposes a subjective performance assessment protocol for evaluating the explainability performance of visual explanation-based FV explainability tools through pairwise comparisons of their explanation outputs. The proposed protocol encompasses a set of key specifications designed to efficiently collect the subjects’ preferences and estimate explainability performance scores, facilitating the relative assessment of the explainability tools. This protocol aims to address the current gap in evaluating the effectiveness of visual explanation-based FV explainability tools, providing a structured approach for assessing their performance and comparing with alternative tools. The proposed protocol is exercised and validated through an experiment conducted using two distinct heatmap-based FV explainability tools, notably FV-RISE and CorrRISE, taken as examples of visual explanation-based explainability tools, considering the various types of FV decisions, i.e., True Acceptance (TA), False Acceptance (FA), True Rejection (TR), and False Rejection (FR). A group of subjects with variety in age, gender, and ethnicity was tasked to express their preferences regarding the heatmap-based explanations generated by the two selected explainability tools. The subject preferences were collected and statistically processed to derive quantifiable scores, expressing the relative explainability performance of the assessed tools. The experimental results revealed that both assessed explainability tools exhibit comparable explainability performance for FA, TR, and FR decisions with CorrRISE performing slightly better than FV-RISE for TA decisions.