Ground delay programs (GDPs) are frequently used to keep U.S. air transportation safe and efficient. Most research on GDPs has focused on optimal design and implementation, with little attention given to retrospective performance evaluation. This research fills that gap by identifying GDP performance criteria, developing associated performance metrics, and evaluating GDP performance metrics across airports and over time. GDP performance criteria are established, and associated performance metrics are specified for five performance goals: capacity utilization, efficiency, predictability, equity, and flexibility. By defining multiple performance metrics, this research enables FAA traffic managers and flight operators to review GDP performance after the fact in a comprehensive way and to uncover GDP performance trends across airports and over time. Through use of data from the FAA Aggregate Demand List and the FAA Aviation System Performance Metrics, historical GDP performance is assessed for San Francisco International Airport (SFO) in California and Newark Liberty International Airport (EWR) in New Jersey for 2006 and 2011. For both airports, capacity utilization and efficiency scores are high, on average, and reflect the importance that FAA and the flight operator community attach to making effective use of available capacity and keeping air transport efficient and safe. In contrast, predictability performance is weaker and more variable. Lack of consensus on how predictability should be measured or valued could have diminished the importance of this measure in GDP decision making. On average, SFO GDPs have higher capacity utilization and predictability, whereas EWR GDPs are more efficient, equitable, and flexible. A comparison of results for 2006 and 2011 shows that GDPs were more predictable but capacity was used less effectively in 2011.