Abstract

In recent years, the need for flexible and instant Natural Language Processing (NLP) pipelines becomes more crucial. The existence of real-time data sources, such as Twitter, necessitates using real-time text analysis platforms. In addition, due to the existence of a wide range of NLP toolkits and libraries in a variety of programming languages, a streaming platform is required to combine and integrate different modules of various NLP toolkits. This study proposes a real-time architecture that uses Apache Storm and Apache Kafka to apply different NLP tasks on streams of textual data. The architecture allows developers to inject NLP modules to it via different programming languages. To compare the performance of the architecture, a series of experiments are conducted to handle OpenNLP, Fasttext, and SpaCy modules for Bahasa Malaysia and English languages. The result shows that Apache Storm achieved the lowest latency, compared with Trident and baseline experiments.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call