Abstract

Microblogging sites, like Twitter, continuously generate a large volume of streaming data. This streaming environment creates new challenges for two concomitant Information Extraction tasks: Entity Mention Detection (EMD) and Entity Detection (ED). The new challenges include (1) continuously evolving topics, which may deprecate model-based approaches quickly; (2) non-literary nature of posts, which makes traditional NLP techniques less effective; and (3) huge volume of streaming data, which makes computationally expensive approaches less suitable. In this paper, we propose an approach for EMD/ED whose creation is guided by the constraints specific to streaming environments from the ground up. Our system TwiCS implements this approach. TwiCS employs a computationally light two-phase process. In the first phase, it exploits simple (low computation) syntactic cues to suggest Entity Mention (EM) candidates. In the second phase, it uses occurrence mining to classify candidates according to their likelihood of being true EMs. Our experiments show that TwiCS achieves an average effectiveness improvement of 14.6%, while maintaining at least 2.64 times higher throughput, when compared to several state-of-the-art systems.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call