For particular domains, duplication may be indicative of cheating or an adversarial act intended to skew data. For Sony’s PlayStation Network (PSN) that services the world’s most popular gaming platform, we observe cheating through duplication of user data in the context of trophies/achievements. This particular domain is representative of the challenges of increasingly prevalent temporal data, where conventional similarity and distance-based deduplication techniques struggle in the context of deduplication. We leverage the Adaptive Sorted Neighborhood Method (ASNM) for temporal domains by applying ASNM, inferring attribute metadata, and performing inference of temporal ordering requirements using subsequence discovery techniques Longest Common Subsequence (LCS) and Needleman–Wunsch (NW). For records of a shared type, we split each record’s time-ordered events into constrained and unconstrained sequences. Through both a binary classification and confidence-based approach, we indicate suspicious (errant) records that do not adhere to the inferred constrained order and may indicate a record as a duplicate if its unconstrained order matches that of another record. ASNM, [Formula: see text] and [Formula: see text] were evaluated against a labeled dataset of 22,794 records from PSN trophy data where duplication may be indicative of cheating. [Formula: see text] resulted in an F1 of 0.949 using the confidence-based approach, outperforming ASNM and [Formula: see text]. ASNM’s best performance was an F1 of 0.708 at the 0.99 similarity threshold; [Formula: see text]’s best performance was an F1 of 0.942 using the confidence-based approach. The significant performance improvement costs little overhead as [Formula: see text] and [Formula: see text] averaged only 3.79% and 5.75% additional runtime, respectively.
Read full abstract