The goal of this paper is to classify Medical Records (MRs) by their diagnostic terms (DTs) according to the International Classification of Diseases Clinical Modification (ICD-9-CM). The challenge we face is twofold: (i) to treat the natural and non-standard language in which doctors express their diagnostics and (ii) to perform a large-scale classification problem.We propose the use of Finite-State Transducers (FSTs) that, for their underlying topology, constrain the allowed input DT string while synchronously produce the output ICD-9-CM class. It is outstanding their versatility to efficiently implement soft-matching operations between terms expressed in natural language to standard terms and, hence, to the final ICD-9-CM code. The FSTs were built up from a corpora and standard resources such as the ICD-9-CM and SNOMED CT amongst others. Our corpora count on a big-data comprising more than 20,000 DTs from MRs from the Basque Hospital System so as to model natural language in this domain. An F1-measure of 91.2 was achieved on a test-set of 2850 randomly selected DTs, and a random 5-fold cross validation on a training set served to double-check the stability of the provided results. Real MRs were of much help to adapt the system to natural language. Misspellings, colloquial and specific language and abbreviations made the classification process difficult. The FSTs were proven efficient in this large-scale classification task. Moreover, the composition operation for FSTs made it easy the addition of new features to the system.
Read full abstract