The SEC's EDGAR log file data set is a collection of web server log files that allow researchers to study the demand for SEC filings. This multiple terabyte data set provides researchers with a direct measure of demand for financial reports, but the log files must be filtered to remove downloads by computer programs (or robots), and the sheer size of the files presents big data challenges. This paper compares three methods for counting human views in the EDGAR log files and aggregates the data on a filing-day basis so that it is accessible to desktop hardware and statistical analysis software. Overall, the three methods agree on the robot-human classification for 96 percent of users, but for sample 10-K filings, they can disagree by up to 27 percent. Download counts may be biased by up to 36 percent if multiple views by the same user are counted. Ryans's 2017 method eliminates multiple download counting and appears to effectively classify robots in cases of disagreement among the measures. The choice of measure may be particularly important when studying demand for Forms 10-K, 10-Q, 4, 13F-HR, as well as SEC comment letters. The aggregated data and sample code are available from the author.
Read full abstract