Animal hibernation is a hypometabolic state that may inform translational research to improve clinical outcomes for hypoxic patients whose oxygen supply does not match their demand. Bears are an excellent animal model, as they decrease their body temperature to only 30-35°C during hibernation and suppress their metabolism down to 25% of normal resting levels while having a more comparable physiology to humans than deep hibernators such as small rodents. As little is known about their sleep patterns during hibernation, we continuously monitored for more than 3500 days via biotelemetry a variety of physiological parameters from 16 captive American black bears in and out of hibernation. We recorded EEG, EOG and EMG signals that are commonly used to determine wake, REM sleep and NREM sleep vigilance states in conventional animal models (polysomnography). Such a data set is too large to fully annotate manually, so we compared two automated approaches. We first manually annotated two one-day recordings at body temperature extremes from each of 6 bears during hibernation when body temperatures were oscillating widely in multiday cycles with intermittent bouts of shivering (Tøien et al. 2011). We also manually scored a one-day, non-hibernating recording at normal (summer) body temperature from these bears. Based on this reference data set, we evaluated two automated scoring applications based on different machine learning classifiers: open source Somnotate (author Paul Brodersen), which is trained on multiple files, and proprietary Somnivore (author Giancarlo Allocca), which is trained on a subset of epochs from each individual recording. Somnotate gave best results during hold-one-out testing when using separate models for hibernating and non-hibernating data, and then performed comparably for both conditions. Somnivore was by design trained on a small subset (100 epochs of each kind) of each file to be tested. Both applications provided typical F-measures against manual reference scores in the 0.90-0.97 range. Outliers in the lower 0.72-0.88 range were highly correlated between the two applications, indicating that specific files are more challenging to annotate — either manually, automatically, or both. We conclude that both applications have accuracies on par with manual scorers when trained on high quality data. Supported by NIH COBRE under grant number [P20GM130443]. This is the full abstract presented at the American Physiology Summit 2024 meeting and is only available in HTML format. There are no additional versions or additional content available for this abstract. Physiology was not involved in the peer review process.