Abstract Introduction: Despite the decline in overall mortality and incidence of cancer in the US population, disparities in cancer care still largely exist within certain groups. The proportion of racial and ethnic minorities recruited to participate in cancer research is persistently lower than the US population. The rapid adoption of electronic health records (EHRs) systems has enabled computational cohort identification and eligibility screening for facilitating cancer research. However, EHR data is known to suffer from data quality issues due to numerous factors, such as care access. Patients with limited care access due to low socioeconomic status would naturally be sparsely represented in EHR data relative to those of higher status. Prospective enrollees may be disqualified due to biased or incomplete EHR data at the screening stage. Methods: To systematically identify data quality variabilities in EHRs, we implemented an information score (i-score) to reflect the density, variability, and irregularity of patients’ data. The pattern can be measured by estimating the variability of the time gaps between observations. For a patient observed n times, the encountered observation times are represented as x(1)…x(n). The relative time interval g for each observed encounter can be defined as g1 = [x(i+1)-x(i)]/[x(n)-x(1)], for I = 1…n - 1. The average amount of information of each observation is then defined as I, between 0 and 1, calculated as: I = 2/n+(n-2)/n[1-sqrt((n-1)Var{gi;i=1…n-1})]. An equality-spaced observation would receive a high i-score. The lower the i-score, the higher the data irregularity. We conducted a case study in a cohort of de novo stage IV breast cancer (DNIV) assembled by a previously validated phenotyping algorithm (Wang AACR 2020). The cohort contains 1,918 DNIV cases between 2004 and 2018 at Mayo Clinic Rochester. We used the i-score to assess patient visits, breast cancer diagnosis codes, and narrative documentation, achieved by the extraction of stage, surgery, de novo, and metastasis information based on the phenotyping definitions. The summary statistics of the i-scores were compared among four racial groups: African American (3%), Asian (1%), Caucasian (87%), and Other (9%). ANOVA F test was used to compare the group differences. Results and Conclusion: The mean i-scores for visit, diagnosis and narrative documentation were 0.712, 0.200, and 0.056 for African American, 0.412, 0.301, and 0.073 for Asian, 0.438, 0.278, and 0.056 for Caucasian, and 0.519, 0.343, and 0.043 for the other category, respectively. The F test indicated significant differences between visit (p<0.001) and diagnosis (p=0.0267) data. No difference was found in narrative documentation (p=0. 8591). In conclusion, there was a substantial i-score variation discovered, especially within the African American group, suggesting a strong need to address EHR quality disparity for cancer research. Citation Format: Sunyang Fu, Liwei Wang, Folakemi T. Odedina, Hongfang Liu. Assessment of EHR data quality variabilities among different racial groups in the cohort of de novo stage IV breast cancer [abstract]. In: Proceedings of the 15th AACR Conference on the Science of Cancer Health Disparities in Racial/Ethnic Minorities and the Medically Underserved; 2022 Sep 16-19; Philadelphia, PA. Philadelphia (PA): AACR; Cancer Epidemiol Biomarkers Prev 2022;31(1 Suppl):Abstract nr A020.
Read full abstract