Abstract
Abstract Background Developments in natural language processing (NLP) and unsupervised machine learning methodologies (e.g., clustering) have given researchers new tools to analyze both structured and unstructured health data. We applied these methods to 2019 Ohio disease intervention specialist (DIS) syphilis records, to determine whether these methods can uncover novel patterns of co-occurrence of individual characteristics, risk factors, and clinical characteristics of syphilis that are not yet reported in the literature. Methods The 2019 DIS syphilis records (n=1,996) contain both structured data (categorical and numerical variables) and unstructured notes. In the structured data, we examined case demographics, syphilis risk factors, and clinical characteristics of syphilis. For the unstructured text, we applied TF-IDF (term frequency multiplied by inverse document frequency) weights, a common way to convert text into numerical representations. We performed agglomerative clustering with cosine similarity using the CLUTO software. Results The cluster analysis yielded six clusters of syphilis cases based on patterns in the structured and unstructured data. The average internal similarities were much higher than the average external similarities, indicating that the clusters were well-formed. The factors underlying three of the clusters related to patterns of missing data. The factors underlying the other three clusters were sexual behaviors and partnerships. Notably, one of the three consisted of individuals who reported oral sex with male or anonymous partners while intoxicated, and one was comprised mainly of males who have sex with females. Conclusions Our analysis resulted in clusters that were well-formed mathematically, but did not reveal novel epidemiological information about syphilis risk factors or transmission that were not already known.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Similar Papers
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.