Abstract Welfare assessment of dairy cows by in-person farm visits provides only a snapshot of welfare, is time-consuming and costly. Possible solutions to reduce the need for in-person assessments would be to exploit sensor data and other routinely collected on farm records. The aim of this study was to develop an algorithm to classify dairy cow welfare based on sensors (accelerometer and milk meter) and farm records (e.g. days in milk, lactation number). In total 318 cows from six commercial farms located in Finland, Italy and Spain (2 farms each) were enrolled for pilot study lasting 135 days. During this time, cows were routinely scored using 14 animal-based measures of good feeding, health and housing based on Welfare Quality (WQ®) protocol. WQ® measures were evaluated every 45 days, during on-farm visits or on daily bases using disease treatments from farm records and temperature-humidity index to evaluate heat stress. The severity and duration of each welfare measure were evaluated, and the final welfare assessment was obtained by summing up the values for each cow on each day, and stratifying the result into three classes: good, average and poor welfare. For a model building, a machine learning (ML) algorithm based on gradient boosted trees (XGBoost) was applied. Two model versions were tested: (1) a global model tested on unseen herd, and (2) a herd specific model tested on unseen part of the data from the same herd. The version (1) served as an example on the model performance on a herd not pre-visited by the evaluator, while version (2) resembled custom-made solution requiring in-person welfare evaluation for model training. Our results indicated that the global model had a low performance with the average sensitivity and specificity of 0.44 and 0.68, respectively. For the herd-specific version, the model performance was higher reaching on average 0.64 sensitivity and 0.80 specificity. The highest classification performance was obtained for cows in poor welfare, followed up by cows in good welfare (balance accuracy of 0.77 and 0.71, respectively), whereas the cows in average welfare were most often incorrectly classified. Since the global model had low classification accuracy, the use of developed model as a stand-alone system based solely on sensor data is infeasible, and a combination of in-person and sensor-based welfare evaluation would be preferable for a reliable welfare assessment. ML based solutions, even with fair discriminative abilities, have the potential to enhance dairy welfare monitoring.