(1) Background: As digital health technology evolves, the role of accurate medical-gloved hand tracking is becoming more important for the assessment and training of practitioners to reduce procedural errors in clinical settings. (2) Method: This study utilized computer vision for hand pose estimation to model skeletal hand movements during in situ aseptic drug compounding procedures. High-definition video cameras recorded hand movements while practitioners wore medical gloves of different colors. Hand poses were manually annotated, and machine learning models were developed and trained using the DeepLabCut interface via an 80/20 training/testing split. (3) Results: The developed model achieved an average root mean square error (RMSE) of 5.89 pixels across the training data set and 10.06 pixels across the test set. When excluding keypoints with a confidence value below 60%, the test set RMSE improved to 7.48 pixels, reflecting high accuracy in hand pose tracking. (4) Conclusions: The developed hand pose estimation model effectively tracks hand movements across both controlled and in situ drug compounding contexts, offering a first-of-its-kind medical glove hand tracking method. This model holds potential for enhancing clinical training and ensuring procedural safety, particularly in tasks requiring high precision such as drug compounding.