INTRODUCTION: While payers are increasingly focused on implementing pay- for-performance measures, quality metrics must reliably reflect true differences in performance among the hospitals profiled. METHODS: tWe used State Inpatient Databases from nine states to characterize serious complications after elective cervical and thoracolumbar fusion. Hierarchical logistic regression was used to risk-adjust differences in case mix, along with variability from low case volumes. The statistical reliability of this risk-stratified complication rate (RSCR) was assessed as the amount of variation between hospitals relative to the total amount of variation for each measure, calculated separately by fusion type and year. In other words, statistical reliability reflected the amount of variation between hospitals that was not due to chance alone. Finally, we estimated the proportion of hospitals that had sufficient case volumes to obtain reliable (>0.7) complication estimates. RESULTS: From 2010-2017 we identified 154,078 cervical and 213,133 thoracolumbar fusion surgeries. 4.2% of cervical fusion patients had a serious complication, and the median RSCR increased from 4.2% in 2010 to 5.5% in 2017. The reliability of the RSCR for cervical fusion was poor and varied substantially by year (range 0.04-0.28). Overall, 7.7% of thoracolumbar fusion patients experienced a serious complication, and the median RSCR varied from 6.8%-8.0% during the study period. Although still modest, the RSCR reliability was higher for thoracolumbar fusion (range 0.16-0.43). Depending on the study year, 0-4.5% of hospitals had sufficient cervical fusion case volume to report reliable (> 0.7) estimates, whereas 15-36% of hospitals reached this threshold for thoracolumbar fusion. CONCLUSION: A metric of serious complications was unreliable for benchmarking cervical fusion outcomes and only modestly reliable for thoracolumbar fusion.