Abstract

Abstract Laboratory testing is integral to the diagnosis and treatment of disease. Advances in electronic medical records and computation have initiated an era of ‘big data’ for laboratory medicine. However, most health data continues to exist in isolated and inaccessible information silos, limiting public utility. To address this gap, we have developed a large, de-identified, longitudinal dataset of laboratory results from over one million patients, called the Million Patient Labset (MPL), which is free and open to the public. The MPL was produced from the combination of two large healthcare systems with IRB and information security approvals. It includes 1,112,550 patients from across 1,920 healthcare sites in two states from 2017 through 2021, with 4,404 different clinical assays represented. Sites included primary care clinics, secondary care facilities for dialysis, infusions, rehabilitation, and other subspecialties, and tertiary centers for major surgery, trauma care, cancer care, and other complex medical services. Assays having fewer than 100 results were excluded. Limited patient demographic and encounter metadata are also included. Protected health information was de-identified in accordance with Health Insurance Portability and Accountability Act of 1996 (HIPAA) safe harbor standards. Service dates and times were altered to obscure absolute timestamps, while approximately preserving interval lengths between test results of the same patient, and general age and time of testing averages across patients. Time of day, day of week, and seasonality were approximately preserved. Data were validated through randomly selected comparisons against the original patient records, prior to destroying re-identification keys. In conclusion, we have produced a large de-identified open-access dataset of clinical laboratory results from over one million real patients. We hope this resource will be valuable for researchers exploring human clinical variation, medical artificial intelligence, reference interval optimization, and other data-driven studies.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call