Term Frequency Based Approach for Binary Classi?cations on Short Sentences

Kaustubh Keshav

doi:10.22214/ijraset.2022.47151

Abstract

Abstract: Great strides have been taken in advancing the domain of Natural Language Processing(NLP) over the past few years. The introduction of attention mechanism in neural networks has led to the creation very complex and efficient architectures like the BERT and GPT. These architectures have been well established as the first choice for most problem statements but, there application in certain simpler use cases generally undermines their real potential. The use case I explore in this research paper is the Natural Language Processing with Disaster Tweets. This is a binary classification task for picking out tweets that refer to a disaster. In this research paper I report the performance of Term Frequency based algorithm without leveraging any vector embedding techniques. Since I am making use of an open source dataset from Kaggle, I have received a leaderboard score of 0.78. The scoring metric used for the leaderboard is the F1 metric and the best score stands at 0.85. The approach leverages the ability of a inverse document frequency(IDF) metric to readily eliminate "stop word" from a corpus and uses a modified term frequency metric<put definition reference> to classify the sentences. The said term frequency metric has been designed in a way to eliminate any impact of stop words that might have escaped from the IDF filter. This paper presents the detailed description of all the algorithms and metrics utilized before coming to a final conclusion.

Full Text