Negative sampling plays a crucial role in implicit-feedback-based collaborative filtering, where it leverages massive unlabeled data to generate negative signals for guiding supervised learning. The current state-of-the-art approaches focus on utilizing hard negative samples that contain more information to establish a better decision boundary. To strike a balance between efficiency and effectiveness, most existing methods adopt a two-pass approach: in the first pass, a fixed number of unobserved items are sampled using a simple static distribution, while, in the second pass, a more sophisticated negative sampling strategy is employed to select the final negative items. However, selecting negative samples solely from the original items in a dataset is inherently restricted due to the limited available choices, and thus may not be able to effectively contrast positive samples. In this paper, we empirically validate this observation through meticulously designed experiments and identify three major limitations of existing solutions: ambiguous trap, information discrimination, and false negative samples. Our response to such limitations is to introduce “denoised” and “augmented” negative samples that may not exist in the original dataset. This direction renders a few substantial technical challenges. First, constructing augmented negative samples may introduce excessive noise that eventually distorts the decision boundary. Second, the scarcity of supervision signals hampers the denoising process. To this end, we introduce a novel generic denoising and augmented negative sampling (DANS) paradigm and provide a concrete instantiation. First, we disentangle the hard and easy factors of negative items. Then, we regulate the augmentation of easy factors by carefully considering the direction and magnitude. Next, we propose a reverse attention mechanism to learn a user’s negative preference, which allows us to perform a dimension-level denoising procedure on hard factors. Finally, we design an advanced negative sampling strategy to identify the final negative samples, taking into account both the score function used in existing methods and a novel metric called synthesization gain. Through extensive experiments on real-world datasets, we demonstrate that our method substantially outperforms state-of-the-art baselines. Our code is publicly available at https://github.com/Asa9aoTK/ANS-Recbole.
Read full abstract