Abstract

Objective: Medical coding is used to identify and standardize clinical concepts in the records collected from healthcare services. The tenth revision of the International Classification of Diseases (ICD-10) is the most widely-used coding with more than 11,000 different diagnoses, affecting research, reporting, and funding. Unfortunately, ICD-10 code sets tend to follow biased, unbalanced, and scattered distributions. These distribution attributes, along with high lexical variability, severely restrict performance when coded clinical records are used to infer code sets in uncoded records. To improve that inference, we explore a combination of example-based methods optimized to capture codes with different appearance frequencies in data sets. Materials and Methods: The proposed exploration has been carried out on Spanish hospital discharge reports coded by experts, excluding all sentences without any biomedical concept. Representations based on semantic and lexical features are explored, using both global and label-specific attributes. In turn, algorithms based on binary outputs, groups of subsets and extreme classification are compared. Lists of codes together with their confidence values (certainty probabilities) are suggested by each method. Results: Diverse spectral behaviors are shown for each method. Binary classifiers seem to maximize the capture of more popular codes, while extreme classifiers promote infrequent ones. In order to exploit such differences, ensemble approaches are proposed by weighting every output code according to the method, confidence value and appearance frequency. The rule-based combination reaches a 46% Precision at 10 ( <inline-formula xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink"> <tex-math notation="LaTeX">$P \text{@} 10$ </tex-math></inline-formula> ), which means a 15% improvement over the best individual proposal. Conclusion: Assembling methods based on weighting each code according to training frequency and performance can achieve better overall Precision scores on extreme distributions, such as ICD-10 coding.

Highlights

  • Most information coming from healthcare services remains unstructured, preventing direct, and easy interpretation of clinical data

  • All S@K values are higher than P@K values, indicating that some of the incorrect suggested codes belong to the same hierarchical branch as some of the unpredicted codes in the report

  • 14% Precision at 10 (P@10) and 23% S@10 means that one of the 10 codes recommended by the baseline usually matches completely and several of the other 9 usually match partially without exceeding together more than 100% in the percentage of coincidence

Read more

Summary

Introduction

Most information coming from healthcare services remains unstructured, preventing direct, and easy interpretation of clinical data. ICD is a clinical cataloging system that enables statistical analyses of morbidity and mortality by defining more than 11,000 diseases, abnormal findings, complaints, social circumstances, external causes of injury, signs, and symptoms. The tenth revision (ICD-10) is one of the main blocks in the clinical information analysis workflow as it is increasingly. ICD-10 is structured in chapters grouping codes of 3 and 4 characters in length. The Spanish version (CIE-10-ES1) extends the specificity of the hierarchical structure with 7-character codes, increasing the amount to approximately 69,000 diagnoses and 72,000 procedures (notice that ICD-10 does not contain procedures). Final CIE-10-ES codes can consist of 3 to 7 characters, depending on the specificity of the diagnosis or procedure. More general and shorter codes are assigned when there is a lack of information and longer ones are given in association

Methods
Results
Discussion
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call