Developments in natural language processing (NLP) and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature. The 2019 DIS syphilis records (n=1,996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software. The cluster analysis yielded six clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. The factors underlying three of the clusters related to patterns of missing data. The factors underlying the other three clusters were sexual behaviors and partnerships. Notably, one of the three consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one was comprised mainly of males who have sex with females. Our analysis resulted in clusters that were well-formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.
Read full abstract