Основополагающие принципы стандартизации и систематизации информации о наборах данных для машинного обучения в медицинской диагностике

Y A Vasilev,K M Arzamasov,S F Chetverikov,O V Omelyanskaya,N A Pavlov,L N Anishchenko,A V Vladzymyrskyy,T M Bobrovskaya,A E Andreychenko

doi:10.21045/1811-0185-2023-4-28-41

Abstract

Backgraund: Active implementation of artificial intelligence technologies in the healthcare in recent years promotes increasing amount of medical data for the development of machine learning models, including radiology and instrumental diagnostics data. To solve various problems of digital medical technologies, new datasets are being created through machine learning algorithms, therefore, the problems of their systematization and standardization, storage, access, rational and safe use become actual. A i m : development of an approach to systematization and standardization of information about datasets to represent, store, apply and optimize the use of datasets and ensure the safety and transparency of the development and testing of medical devices using artificial intelligence. M a t e r i a l s a n d m e t h o d s : analysis of own and international experience in the creation and use of medical datasets, medical reference books searching and analysis, registry structure development and justification, scientific publications search with the keywords “datasets”, “registry of medical data”, placed in the databases of the RSCI, Scopus, Web of Science. R e s u l t s . The register of medical instrumental diagnostics datasets structure has been developed in accordance with stages of datasets lifecycle: 7 parameters at the initiation stage, 8 – at the planning stage, 70 – dataset card, 1 – version change, 14 – at the use stage, total – 100 parameters. We propose datasets classification according to the purpose of their creation, a classification of data verification methods, as well as the principles of forming names for standardization and datasets presentation clarity. In addition, the main features of the organization of maintaining this registry are highlighted: management, data quality, confidentiality and security. C o n c l u s i o n s . For the first time, an original technology of medical datasets for instrumental diagnostics structuring and systematization is proposed. It is based on the developed terminology and principles of information classification. This makes it possible to standardize the structure of information about datasets for machine learning, and ensures the storage centralization. It also allows to get quick access to all information about the dataset, and ensure transparency, reliability and reproducibility of artificial intelligence developments. Creating a registry makes it possible to quickly form visual data libraries. This allows a wide range of researchers, developers and companies to choose data sets for their tasks. This approach ensures their widespread use, resource optimization and contributes to the rapid development and implementation of artificial intelligence.

Full Text