The proliferation of social media platforms has led to a significant increase in the sharing of memes and other image-based content. However, some of these images may contain inappropriate or offensive text, which can be challenging to detect using traditional methods. Because the open-source existing solutions does not classify the text content in images. This project aims to develop a robust solution for detecting Not Safe for Work (NSFW) text content in images, with a particular focus on social media memes. The objectives of this project are to accurately extract text from images using optical character recognition (OCR) techniques, and to classify the extracted text as either NSFW or Safe for Work (SFW) using a fine-tuned natural language processing (NLP) model. The methodology employed in this project involves several key steps. First, the Keras OCR library is utilized to extract text from input images, as it has demonstrated superior performance compared to other OCR tools for social media images. Next, a preprocessing step is performed to align the extracted text in the correct sequence using coordinates, distance from the image origin, and the Pythagorean theorem. The pre-processed text is then passed to a fine-tuned BERT-BASE-UNCASED language model, which has been trained on a dataset of Reddit posts (NSFW and SFW) to classify the text as either NSFW or SFW. Key Words: OCR, Inappropriate, BERT, NSFW, SFW, Image, Text
Read full abstract