Abstract

Semi-supervised deep clustering methods attract much attention due to their excellent performance on the end-to-end clustering task. However, it is hard to obtain satisfying clustering results since many overlapping samples in industrial text datasets strongly and incorrectly influence the learning process. Existing methods incorporate prior knowledge in the form of pairwise constraints or class labels, which not only largely ignore the correlation between these two supervision information but also cause the problem of weak-supervised constraint or incorrect strong-supervised label guidance. In order to tackle these problems, we propose a semi-supervised method based on pairwise constraints and subset allocation (PCSA-DEC). We redefine the similarity-based constraint loss by forcing the similarity of samples in the same class much higher than other samples and design a novel subset allocation loss to precisely learn strong-supervised information contained in labels which consistent with unlabeled data. Experimental results on the two industrial text datasets show that our method can yield 8.2%–8.7% improvement in accuracy and 13.4%–19.8% on normalized mutual information over the state-of-the-art method.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.