It is imperative in a medical domain that protection of information does not allow an individual to be overlooked. In medical domain, research community encourages use of real-time datasets for research purposes. These real-time datasets contain structured and unstructured (natural language free text) information that can be useful to researchers in various disciplines including computational linguistics. On the other hand, these real-time datasets cannot be distributed without anonymization of Protected Health Information (PHI). The information of PHI (such as Name, age, address, etc.) that can identify an individual is unethical. Therefore, we present a rule-based Natural Language Processing (NLP) anonymization system using a challenging corpus containing medical narratives and ICD-10 codes (medical codes). This anonymization module can be used for pre-processing the corpus containing identifiable information. The corpus used in this research contains '2534' PHIs in '1984' medical records in total. 15% of the labelled corpus was used for improvement of guidelines in the identification and classification of PHI groups and 85% was held for the evaluation. Our anonymization system follows two step process: (1) Identification and cataloging PHIs with four PHI categories ('Patients Name', 'Doctors Name', 'Other Name [Names other than patients and doctors]', 'Place Name'), (2) Anonymization of PHIs by replacing identified PHIs with their respective PHI categories. Our method uses basic language processing, dictionaries, rules and heuristics to identify, classify and anonymize PHIs with PHI categories. We use standard metrics for evaluation and our system outperforms against human annotated gold standard with 100% of F-measure by increasing 39% from baseline results, which proves the reliability of data usage for research.
Read full abstract