Introduction: Flow cytometry performs multi-parameter analysis of cells and analyzes surface and intracellular markers for accurate phenotypic characterization of a cell population. Flow cytometry is used extensively in the diagnosis and classification of various hematologic neoplasms. However, analysis of the generated data is time consuming and remains subjective, requiring special skill and experience. Furthermore, some diagnostic classes, such as myeloproliferative neoplasms (MPN) and myelodysplastic syndrome (MDS), are difficult to diagnose using flow cytometry. The RNA levels of the CD markers used in flow cytometry can be reliably quantified using next generation sequencing (NGS). However, when all cells are jointly sequenced, studying subpopulation of cells is lost, which hinders accurate diagnosis. However, machine learning algorithms are capable of multi-marker normalizing and compensate for the loss of subclonal analysis. To validate this assumption, we explored the potential of using the RNA levels of 30 CD markers along with a machine learning algorithm in the differential diagnosis between various types of hematologic neoplasms. Methods: RNA was extracted from fresh bone marrow and peripheral blood samples from 172 acute myeloid leukemia (AML), 369 normal control, 68 MPN, 218 MDS, 93 acute lymphoblastic leukemia (ALL), 74 chronic lymphocytic leukemia (CLL), 38 mantle cell lymphoma, and 83 multiple myeloma cases. The samples were consecutive and collected without selection. RNA sequencing was performed using a targeted hybrid capture panel that included CD1A, CD2, CD3D, CD3E, CD3G, CD4, CD5, CD7, CD8A, CD8B, CD10, CD14, CD19, CD20, CD22, CD33, CD34, CD38, CD40, CD44, CD47, CD68, CD70, CD74, CD79A, CD79B, CD81, CD138, CD200, and CD274 genes. Salmon v1.4.0 software was used for expression quantification (TPM). Machine learning algorithm (random forest) was used for classifying diseases. Two thirds of samples were used for training the random forest algorithm and one third was used for testing. Results: While frequently a diagnosis can be made by simply inspecting the RNA levels of various CD markers, machine learning is needed when the fraction of the neoplastic cells is low. Using machine learning (random forest), diagnosis of most hematologic neoplasms was achieved with high sensitivity and specificity in the testing set. Area under the curve (AUC) was at 0.972 (95% CI: 0.950-0.994) for AML vs. normal, 0.936 (95% CI: 0.898-0.974) for normal vs. MM, 0.965 (95% CI: 0.909-1.00) for mantle vs. CLL, 0.962 (95% CI: 0.907-1.00) for CLL vs. ALL, 0.935 (95% CI: 0.866-1.00) for CLL vs. normal, and 0.964 (95% CI: 0.927-1.00) for AML vs. ALL. Diseases that are difficult to diagnose by routine flow cytometry were diagnosed by RNA expression and machine learning at acceptable accuracy. For example, AUC was at 0.761 (95% CI: 0.689-0.834) for MDS vs. normal, 0.831 (95% CI: 0.762-0.901) for MDS vs. AML, 0.888 (95% CI: 0.822-0.954) for MDS vs. MPN, and 0.785 (95% CI: 0.698-0.872) for MPN vs. normal. Conclusions: This data demonstrates that NGS quantification of RNA from 30 CD markers when combined with machine learning is adequate for reliable diagnosis of various types of hematologic neoplasms. This approach can provide valuable information to distinguish between MPN, MDS, and normal bone marrow that flow cytometry cannot provide. Furthermore, this technology can be automated and less susceptible to human error and practically can be used as a replacement to routine flow cytometry analysis.
Read full abstract