Abstract

Electricity usage (demand) data are used by utilities, governments, and academics to model electric grids for a variety of planning (e.g., capacity expansion and system operation) purposes. The U.S. Energy Information Administration collects hourly demand data from all balancing authorities (BAs) in the contiguous United States. As of September 2019, we find 2.2% of the demand data in their database are missing. Additionally, 0.5% of reported quantities are either negative values or are otherwise identified as outliers. With the goal of attaining non-missing, continuous, and physically plausible demand data to facilitate analysis, we developed a screening process to identify anomalous values. We then applied a Multiple Imputation by Chained Equations (MICE) technique to impute replacements for missing and anomalous values. We conduct cross-validation on the MICE technique by marking subsets of plausible data as missing, and using the remaining data to predict this “missing” data. The mean absolute percentage error of imputed values is 3.5% across all BAs. The cleaned data are published and available open access: https://doi.org/10.5281/zenodo.3690240.

Highlights

  • Background & SummaryElectricity system models typically require electricity usage as a known input

  • We find that as of 10 September 2019, 2.2% of the hourly data in the Energy Information Administration (EIA)’s database are missing, and another 0.5% are either physically implausible or suspicious for other reasons as outlined in the Methods section

  • The EIA classifies an hourly demand value as anomalous and marks it for imputation if the value is “missing or reported as negative, zero, or at least 1.5 times greater than the maximum of past total demand values reported by that balancing authorities (BAs)”3

Read more

Summary

Background & Summary

Electricity system models typically require electricity usage (demand) as a known input. The EIA publishes demand data via an open access data portal: https://www.eia.gov/opendata/qb.php?category=2122628. Many BAs are missing hundreds to thousands of hours of demand data, this lack hampers the utility of this data for energy system modelers. The EIA uses a nearest neighbor method where values from the prior hour and day are used to replace missing or anomalous data. In this Data Descriptor, we present a method to create complete time series data sets from a set of correlated demand records. The algorithms were designed to incorporate the time series structure of the data and often use excessive deviations from continuity as a reason to flag a value. We aim to provide updated, cleaned data every 12 months to include recently published demand data and incorporate any available corrections to historical data

Methods
Findings
Code availability
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call