Abstract Background Competence in reaching the correct diagnosis from analysis of clinical pathology specimens requires exposure to a large volume and breadth of cases. Images from textbooks are critical resources that supplement actual cases with well curated information. However, learning from textbooks has several setbacks. Images from individual textbooks are limited in number and often do not adequately capture variation of the disease entity. Textbooks also do not usually provide a platform for learners to assess their diagnostic abilities. This burdens the learner with creating flashcards for self-assessment, which is manually intensive. Methods To address these limitations, we developed PathBrowser, a web browser-accessible program that leverages the PyMuPDF Python package (Artifex) to extract images and captions from textbooks in PDF format. Captions are written to the image files using ExifTool (Phil Harvey), making the images searchable by keywords. The program then displays the image while hiding the caption, challenging the user to enter the appropriate diagnosis. To minimize work by the user to prepare flashcard questions and answers, the program can automatically generate quiz questions by using natural language processing tools and a Naïve Bayes (NB) classifier (NLTK and SpaCy packages) to identify important terms from the caption as either diagnoses or cytologic features. Results As a proof of principle, we used PathBrowser to build an image repository of myeloblasts and common mimickers, including reactive lymphocytes and chronic lymphoid leukemias, towards training junior pathology residents to identify blasts in peripheral blood and body fluids. In total, PathBrowser extracted 173 images (63 myeloblasts and 110 non-blasts) from 2 textbook sources (Atlas of Diagnostic Hematology by Salama et al., and Hematopathology by Jaffe et al.). We trained a NB classifier to label images as “myeloblast” or “non-myeloblast,” based on keywords in the caption. We manually ascertained that PathBrowser extracted the appropriate caption from 30/30 (100%) images of myeloblasts. Of these, the NB classifier correctly extracted the term “myeloblast” in 28 (93%) images. A survey of pathology residents showed that 89% of respondents agreed the program is a useful pathology learning tool. Conclusions We developed a program that makes reading from textbooks an interactive learning experience. By automating the manual-intensive steps of extracting images and captions and preparing flashcards with quiz questions using caption terms classified by machine learning tools, learners can focus on recognizing the diagnostic features represented by the images. Finally, our program enables the rapid generation of a repository of images from diverse sources, so that learners can browse a wide spectrum of disease entities and their variants.
Read full abstract