Many Deaf and Hard-of-Hearing (DHH) individuals across the world benefit from various captioning services for accessing information existing in the form of speech. Today, the Automatic Speech Recognition (ASR) technology has the potential to replace the existing human-provided services for captioning due to their lowered cost of operation and ever-increasing accuracy. However, as with most automatic systems, ASR technology is still not fully perfect --- which leads to issues in terms of its trust and acceptance when focusing on building a human-free service of communication for these users. Thus, there is a need for evaluating the usability these systems with the users before deploying them into the real-world. Yet, most researchers lack access to sufficient DHH users for extrinsic, empirical studies of these automatic captioning systems. This articles presents our work on the development of automatic caption quality evaluation metric which we design and validate through studies and real-world observations with DHH users.