Background and objectivesAmbulatory blood pressure monitoring (ABPM) is usually reported in descriptive values such as circadian averages and standard deviations. Making use of the original, individual blood pressure measurements may be advantageous, particularly for research purposes, as this increases the flexibility of the analytical process, enables alternative statistical analyses and provide novel insights. Here we describe the development of a new multistep, hierarchical data extraction algorithm to collect raw data from .pdf reports and text files as part of a large multi-center clinical study. MethodsOriginal reports were saved in a nested file system, from which they were automatically extracted, read and saved into databases with custom made programs written in Python 3. Data were further processed, cleaned and relevant descriptive statistics such as averages and standard deviations calculated according to a variety of definitions of day- and night-time. Additionally, data control mechanisms for manual review of the data and programmatic auto-detection of extraction errors was implemented as part of the project. ResultsThe developed algorithm extracted 97% of the data automatically, the missing data consisted mostly of reports that were saved incorrectly or not formatted in the specified way. Manual checks comparing samples of the extracted data to original reports indicated a high level of accuracy of the extracted data, no errors introduced due to flaws in the extraction software were detected in the extracted dataset. ConclusionsThe developed multistep, hierarchical data extraction algorithm facilitated collection from different file formats and paired with database cleaning and data processing steps led to an effective and accurate assembly of raw ABPM data for further and adjustable analyses. Manual work was minimized while data quality was ensured with standardized, reproducible procedures.
Read full abstract