Abstract
Social media platforms have rich text data, which can be used in data mining and analysis. However, given the fact, the evolution speed of natural languages is rapid in social media, and data on social media is very noisy. This is a great challenge to the accuracy of data analysis. To overcome this problem, we propose a topic-relevant content extraction (TRCE) based on deep multiple instance classification, leveraging existing information and hierarchical relationships among texts under a thread on social media as weak supervision to extract topic-strong-relevant data and filter out noise accurately without manually labeling data. The proposed method introduces latent variables, Bernoulli distribution, and variational inference into multiple-instance learning (MIL) to generate pseudo labels. Then we employ a dual-stream neural network with a 3-stage training process to achieve training MIL end-to-end. Experimental results show TRCE has a significant improvement compared with other MIL methods. Meanwhile, it only has a little decrease compared with supervised text classification on accuracy and F1 score. Given the fact TRCE does not need manually labeled data at all, while supervised classification relies heavily on labeled data, TRCE is a competitive method to extracting topic-relevant data and filtering out noise on social media.
Published Version
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have