Background Idiopathic cytopenia of undetermined significance (ICUS) is characterized by an unexplained, persistent cytopenia without a definite diagnosis of hematologic or non-hematologic conditions, and that does not meet criteria for myeloid neoplasms. The progression of ICUS to myeloid malignancies has been associated with certain clinical features, such as genetic mutations or allele frequencies. However, the incidence of progression and the risk factors vary among previous studies. While prognostic scoring systems are well-established in myelodysplastic syndrome (MDS) to predict progression to acute myeloid leukemia and survival outcomes, similar predictive tools for ICUS are currently lacking. Considering that a significant proportion of patients with ICUS share similar genetic abnormalities and clinical features with those with MDS, there is a need for predictive tools for ICUS. In this study, we aimed to establish a predictive model for disease progression from ICUS to myeloid malignancies. Methods Patients with ICUS were identified using electronic medical records, and those meeting the diagnostic criteria of ICUS were included. ICUS was defined as persistent cytopenia lasting for at least 4 months, with no underlying disease or condition that could explain the cytopenia which included history of cytotoxic chemotherapy or radiation therapy, blood or bone marrow (BM) disorders, autoimmune diseases, solid organ transplantation, active infection by atypical pathogens, and the use of immunosuppressant and chemotherapeutic agents. Cytopenia was defined as follows: hemoglobin <13 g/dL in males or <12 g/dL in females, neutrophil count <1.9 × 10 9/L, and/or platelets <150 × 10 9/L. For the predictive model, various features were considered, including medication history, laboratory data, comorbidities, family history of cancer and major disorders, smoking history, alcohol consumption, and physical measurements. Numeric data from the BM report were standardized using regular expressions, while free text or comments from the BM report and chromosomal analysis were extracted and processed using a pre-trained language model called PubMedBERT to generate embeddings. The primary objective of this study was to develop a predictive model for identifying progression from ICUS to myeloid malignancies. The performance of the model was assessed using machine learning algorithms, specifically XGBoost, support vector machine, and logistic regression. Results Among the initial 6,962 patients with medical records of the BM examination between January 2000 and December 2021, 5,147 patients were excluded from the modeling process due to secondary causes of cytopenia. Ultimately, 1,815 patients were finally included in the study. Of 1,815 patients, 47 (2.6%) were diagnosed with myeloid malignancies from further BM examination. The predictive model, utilizing XGBoost and incorporating standardized text and embedded features, demonstrated favorable performance, with a mean area under the receiver operating characteristic curve of 0.800 (Figure). Additionally, our analysis confirmed that the XGBoost model effectively processed features from text data, as we assessed the feature importance using Shapley values. Conclusion In conclusion, our study successfully established a predictive model for disease progression from ICUS to myeloid malignancies, demonstrating the potential utility of clinical predictive tools. Further integration of genetic data holds promise for enhancing the model's performance and providing clinical insights for managing patients with ICUS.