The impact of biased sampling of event logs on the performance of process discovery

Mohammadreza Fani Sani,Sebastiaan J Van Zelst,Wil M P Van Der Aalst

doi:10.1007/s00607-021-00910-4

Mohammadreza Fani Sani, Sebastiaan J Van Zelst + Show 1 more

Open Access

https://doi.org/10.1007/s00607-021-00910-4

Copy DOI

Abstract

With Process discovery algorithms, we discover process models based on event data, captured during the execution of business processes. The process discovery algorithms tend to use the whole event data. When dealing with large event data, it is no longer feasible to use standard hardware in a limited time. A straightforward approach to overcome this problem is to down-size the data utilizing a random sampling method. However, little research has been conducted on selecting the right sample, given the available time and characteristics of event data. This paper systematically evaluates various biased sampling methods and evaluates their performance on different datasets using four different discovery techniques. Our experiments show that it is possible to considerably speed up discovery techniques using biased sampling without losing the resulting process model quality. Furthermore, due to the implicit filtering (removing outliers) obtained by applying the sampling technique, the model quality may even be improved.

Full Text