A machine learning approach for detecting customs fraud through unstructured data analysis in social media

Bundidth Dangsawang,Siranee Nuchitprasitchai

doi:10.1016/j.dajour.2024.100408

Abstract

Goods and services are sold through social media by individuals not authorized as legitimate dealers, resulting in lost taxes and customs duties to governments. This study proposes a model called SHIELD for detecting these violations through unstructured data in social media. The process involves collecting 2,373,570 records of commercial goods from social media platforms such as Twitter and Facebook in three phases. In Phase 1, keywords for labeling are collected for text classification. Three categories of results are defined: Red Line for smuggled goods, unpaid duty, prohibited goods, and restricted goods; Green Line for non-commercial goods; and Inspect for goods that cannot be identified from the text and require further investigation. Phase 2 and Phase 3 use keywords to detect smugglers from unstructured social media data for labeling grouped by three algorithms of Logistic Regression (LR), Gated Recurrent Unit (GRU), and Long Short-Term Memory (LSTM), employed to classify imported illegal products. The results of all tests show that the LSTM technique had the best accuracy of 99.44% and the best average F1 score of 90.55%. Using algorithms and techniques such as LR, GRU, and LSTM demonstrates the potential of machine learning and natural language processing in detecting illegal activities and promoting economic security.

Full Text