Abstract

We present a novel approach to discovering small groups of anomalously similar pieces of free text. The UK's National Reporting and Learning System (NRLS) contains free text and categorical variables describing several million patient safety incidents that have occurred in the National Health Service. The groups of interest represent previously unknown incident types. The task is particularly challenging because the free text descriptions are of random lengths, from very short to quite extensive, and include arbitrary abbreviations and misspellings, as well as technical medical terms. Incidents of the same type may also be described in various different ways. The aim of the analysis is to produce a global, numerical model of the text, such that the relative positions of the incidents in the model space reflect their meanings. A high dimensional vector space model of the text passages is produced; TF-IDF term weighting is applied, reflecting the differing importance of particular words to a description's meaning. The dimensionality of the model space is reduced, using principal component and linear discriminant analysis. The supervised analysis uses categorical variables from the NRLS, and allows incidents of similar meaning to be positioned close to one another in the model space. Anomaly detection tools are then used to find small groups of descriptions that are more similar than one would expect. The results are evaluated by having the groups assessed qualitatively by domain experts to see whether they are of substantive interest.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.