Natural Language Processing

Pramod Singh

doi:10.1007/978-1-4842-7777-5_9

Abstract

This is the last chapter of the book and focuses on the techniques to tackle text data using PySpark. Today text-form data is being generated at a lightning pace with multiple social media platforms offering users the options to share their reviews, suggestions, comments, etc. The area that focuses on making machines learn and understand textual data to perform some useful tasks is known as Natural Language Processing. Text data could be structured or unstructured, and we must apply multiple steps to make it analysis ready. The NLP field is already a huge area of research and has an immense number of applications being developed that use text data such as chatbots, speech recognition, language translation, recommender systems, spam detection, and sentiment analysis. This chapter demonstrates a series of steps to process text data and apply Machine Learning algorithms on it. It also showcases sequence embeddings that are learned using word2vec in PySpark as a bonus part.

Full Text