The rapid digitization of media, driven by technological advancements, has accelerated the spread of information through online platforms. This has heightened the need for robust fact-checking mechanisms to counter misinformation. The prevalence of misinformation necessitates the development of automated claim detection systems to support efficient automated or semi-automated fact-checking processes. Existing claim detection systems predominantly focus on the English language, with limited resources available for other regional languages like Bangla. This paper proposes a novel ensemble machine learning framework for the effective detection of claims in a low-resource language like Bangla, a critical initial step in the automated fact-checking process. The proposed weighted ensemble technique combines Support Vector Machines, Bernoulli Naive Bayes, and Decision Trees as base classifiers to effectively detect claims. An annotated text dataset comprising 5010 sentences sourced from various online platforms, including several online fact-checking sites, was developed. To determine the optimal model and feature representation for claim detection, various machine learning algorithms were evaluated using BoW, TF-IDF, Word2Vec, and FastText features. The efficacy of ensemble models was examined by investigating both averaging and weighting strategies. Evaluation metrics showcased that the proposed weighted ensemble approach outperformed all baseline models, achieving a maximum F1 score of 0.87. To the best of our knowledge, this study is the first and only approach to claim detection in the Bangla language, with the potential for extension to other resource-constrained languages. Our work aspires to serve as a crucial tool in the fight against misinformation by advancing the accuracy and transparency of information.
Read full abstract