We present a novel approach to discovering small groups of anomalously similar pieces of free text.
The UK’s National Reporting and Learning System (NRLS) contains free text and categorical variables describing several million patient safety incidents that have occurred in the National Health Service. The groups of interest represent previously unknown incident types. The task is particularly challenging because the free text descriptions are of random lengths, from very short to quite extensive, and include arbitrary abbreviations and misspellings, as well as technical medical terms. Incidents of the same type may also be described in various different ways.
The aim of the analysis is to produce a global, numerical model of the text, such that the relative positions of the incidents in the model space reflect their meanings. A high dimensional vector space model of the text passages is produced; TF-IDF term weighting is applied, reflecting the differing importance of particular words to a description’s meaning. The dimensionality of the model space is reduced, using principal component and linear discriminant analysis. The supervised analysis uses categorical variables from the NRLS, and allows incidents of similar meaning to be positioned close to one another in the model space. Anomaly detection tools are then used to find small groups of descriptions that are more similar than one would expect. The results are evaluated by having the groups assessed qualitatively by domain experts to see whether they are of substantive interest.