Extraction of structured data from unstructured medical records using text data mining technologies: process automation

I V Moskalev,O S Krotova,L A Khvorova,D G Bobkova

doi:10.1088/1742-6596/1615/1/012031

I V Moskalev, O S Krotova + Show 2 more

Open Access

https://doi.org/10.1088/1742-6596/1615/1/012031

Copy DOI

Abstract

The paper discusses technologies for processing text-based medical data stored in the Microsoft Word text format. Processing such data is aimed at data mining the text for new, potentially useful knowledge that can later be used to study various diseases and to form a personalized approach to diagnosis and treatment. During the study, 3244 depersonalized medical records of children and adolescents in Altai Krai suffering from diabetes mellitus were processed. Information is stored in the records in both structured and unstructured forms. Most of the valuable data, such as the dynamics of the disease course, patient complaints, patient’s life history, etc. are kept in natural language. The difficulty of processing text medical records is associated with a great number of abbreviations, synonyms and misprints, which makes it impossible to use a unified template. Therefore, the study is aimed at minimizing information losses while extracting knowledge by means of applying various text data mining methods. The practical outcome of this study is a database containing a large amount of valuable information on diabetes mellitus, various types of its clinical course and complications. The obtained data will be further used to build mining models for diagnosing and predicting the disease and its complications. To reach the goal of the research, we used the PostgreSQL DBMS and modern linguistically oriented software created within the framework of the Python programming language and its libraries: python-docx, natasha, Natural Language Toolkit (NLTK).

Full Text