Data Science and Computational Linguistics on a Collection of Interviews with Family Caregivers in Heart Failure

Soyoung Choi,Lisa Kitko,Judith E Hupcey,Barbara Birriel,Suhang Wang

doi:10.1016/j.cardfail.2020.09.259

Abstract

Introduction Data science techniques and data-driven research designs are essential in analysis of big data. Recognizing the contemporary trend of scientific-technical work, this study employed text mining techniques to analyze several hundred serial interview scripts obtained from the family caregivers of individuals with heart failure (hereafter, HF). To fill the knowledge gap in longitudinal observations for capturing the attributes of family caregiving over the unpredictable HF trajectory, this study investigated (1) the dominant topics of family caregiving experiences, as estimated by latent Dirichlet allocation (LDA) topic modelling; and (2) the distribution of positive and negative words used by family caregivers, as uncovered by performing lexicon-based sentiment analysis across all of the interview scripts. Methods R statistical software was utilized for this study. This text-as-data research method resulted in the analysis of a total of 721 interview scripts (i.e., 721 .docx files) based on the five steps of the Knowledge Discovery in Textual Databases (KDT) model. This model centers on the process of extracting meaningful, non-trivial patterns or knowledge from a set of unstructured texts. To avoid noise in the topic modelling and sentiment analyses, all punctuation, special characters, and meaningless words (e.g., “a,” “an,” “the,” “and,” “it,” “they”) were removed. Next, each word was converted to its stem word by applying the Porter stemming algorithm. Finally, the “topicmodels” and “sentimentr” R packages were used for text mining. Results The total number of words after text preprocessing was 65,620. LDA topic modelling revealed five latent topics (k = 5) among all of the interview scripts as the most interpretable model (Figure 1). The interpreted topics were as follows: (1) facing the loss of sick family member; (2) interacting with a healthcare provider; (3) juggling multiple roles; (4) changing medical treatments and the impact on HF symptoms; and (5) formulating caregiving routines in daily lives. The top 20 most common positive and negative words were visualized using a word cloud (Figure 2) and the total numbers of negative terms (n = 1,161, 1.77%) and positive terms (n = 793, 1.20%) were counted. Conclusion The data-driven approach enabled an unbiased text analysis of serial interview scripts in a time-preserving and scale-efficient manner. Evidently, text mining can play an assistive role in discovering hidden meanings and capturing vital scientific insights coupled with traditional qualitative research.

Full Text