Abstract

Large-scale multi-label text classification (LMTC) aims at tagging texts with its most relevant subset of labels from a large candidate label set. State-of-the-art LMTC methods are mostly based on deep neural networks trained using a cross-entropy loss. These methods require massive human labeled training data to capture the underlying label correlations implicitly, which could be a burden in real-world applications. We observe that many correlated labels are missed by human annotators, but their label names in fact co-occur often in document contexts. In this paper, we propose a novel text classification framework FLC, incorporating “free” label correlations derived from massive raw text corpora. Specifically, we design a principled similarity measurement to evaluate the label correlation based on the co-occurrence statistics of their corresponding label names. Based on the derived label correlations, we add a complementary loss to make the overall loss more robust when correlated labels were predicted by the classifier but missed by human annotators. Experiments on two real-world datasets demonstrate the superiority of FLC over state-of-the-art methods, especially when the human labels are not perfect and the training data is limited. Moreover, the computational overhead of employing FLC is negligible.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call