Introduction: Blood count analysers are routinely used to assess health status and for disease monitoring. The major analysers generate more than 300 red cell, leukocyte and platelet parameters. However, in clinical practice, only 10-20 are utilised; a greater number are reviewed by laboratories but some are merely for ensuring instrument stability and quality assurance. Although there are hundreds more measurable parameters available through extended analytical channels, these are not assessed due to the complexity of the numerical analysis and questions as to the clinical utility. However, these extended analytical parameters have potential to be used to generate real-time ‘3-Dimensional (3-D)‘ details of blood cells. Aim: In this study we assessed 312 parameters generated from 5,800 patient samples (of all ages and genders) at a tertiary care hospital (National Institute of Blood Disease, Karachi-Pakistan). These were anonymously outputted and processed by machine learning (ML). Method: The methodology of the present study was based on the waterfall model. The output data from a haematology analyser (Sysmex XN-1000, Kobe Japan) in CSV format, having a total 433 columns, was pre-processed to remove unnecessary features (such as date, analysis details, rack position, receiving time, alerts and others) using Pandas and Numpy libraries, and where required scaling was performed. Data labeling was conducted on conclusions reported on their respective confirmatory tests. The processed data (which included 312 features of the total 433 columns) was labeled (with 67 conclusions on their confirmatory tests). The extracted data was fed to the artificial intelligence Machine Learning models through Python modules and libraries for training, testing, and validation purposes. The web application was developed on modern Python framework (Flask) to automate and provide an option of ‘drag and drop’ the CSV file exported from analyser, we connected pre-processing (data engineering), Machine Learning, and Prediction view by set of different tools (majorly JavaScript libraries). This generated various metrics including accuracy, precision, recall and the mean of precision and recall (F1 score) in relation to the prediction (results) from the CSV file submitted to our system. For each forecast the entire precision report and prediction box were added an optional visualisation on our web panel. Results: Analysis of 1.8 million data points (312 parameters x 5,800 samples) presented promising predictive potential, as, on principal component analysis (PCA) pilot the total variance was remained 41.6% showing that a linear combination of parameters can explain much variability. On a heat map the clustering and visualisation advocated the predictive potential and signatory deviational trends (fingerprints) respectively of these 3-D blood cell features. Examples included separation of myeloid from lymphoid, chronic from acute, bacteria from viral, deficiency of iron from deficiency from vitamin B12 / Folic acid, and differentiation of haemoglobinopathies. The patterns of normal, immature and abnormal blood cells under the title of cell population data was well demonstrated from results of our machine learning models. Of note, we observed an accuracy of 85.6% along with 91.2% precision for one of the ML models used (Random Forest Classifier). Conclusion: The opportunities and challenges of such high dimensional cell population data derived from a complete blood count can provide a novel patient-specific haematological fingerprint. This extended deviational patterning (fingerprint) can provide interpretive diagnostic data with practical disease-specific patterns. This pilot study shows that complete blood count data driven machine learning applications has great potential to uncover disease-associated patterns which could be applied in practice. It also has capacity to provide baseline testing would could assist in sequential health monitoring and potentially the generation of personal reference ranges.
Read full abstract