An Electrical Submersible Pump (ESP) is an important equipment used in the industry for lifting liquids in various types of wells. An ESP is widely used in the oil industry for offshore exploration. Detecting a faulty ESP before installation is a predictive maintenance measure in order to extend its operational time. Machine learning fault diagnosis is an effective way for performing this task. Machine learning fault diagnosis algorithms are highly dependent of the availability of an appropriate problem dataset. This paper describes in detail the problem of ESP fault diagnosis and the ESPset dataset, a real-world and public dataset for vibration-based fault diagnosis of electrical submersible pumps used on offshore oil exploration. In addition, the paper also proposes an experimental framework for adequately comparing research works based on the ESPset dataset and defines benchmark classifiers and respective results as referential to the fault diagnosis research community. The framework considers the fact that some subset of samples are not drawn independently, and therefore, proposes a cross-validation sampling strategy that mitigates the similarity bias among samples. Indeed, this work shows that a conventional k-fold cross-validation may lead to a clear overestimation of the average performance. This fact is supported by results which show that the best classification model drops from a mean F-measure of 0.887 to 0.733 when removing the similarity bias from the data.