Finding Dutch natives in online forums

Bernard Van Den Boom,Cor J Veenman

doi:10.1080/20961790.2018.1482042

Abstract

Law enforcement agencies have a restricted area in which their powers apply, which is called their jurisdiction. These restrictions also apply to the Internet. However, on the Internet, the physical borders of the jurisdiction, typically country borders, are hard to discover. In our case, it is hard to establish whether someone involved in criminal online behavior is indeed a Dutch citizen. We propose a way to overcome the arduous task of manually investigating whether a user on an Internet forum is Dutch or not. More precisely, we aim to detect that a given English text is written by a Dutch native author. To develop a detector, we follow a machine learning approach. Therefore, we need to prepare a specific training corpus. To obtain a corpus that is representative for online forums, we collected a large amount of English forum posts from Dutch and non-Dutch authors on Reddit. To learn a detection model, we used a bag-of-words representation to capture potential misspellings, grammatical errors or unusual turns of phrases that are characteristic of the mother tongue of the authors. For this learning task, we compare the linear support vector machine and regularized logistic regression using the appropriate performance metrics f1 score, precision, and average precision. Our results show logistic regression with frequency-based feature selection performs best at predicting Dutch natives. Further study should be directed to the general applicability of the results that is to find out if the developed models are applicable to other forums with comparable high performance.

Highlights

The police and intelligence agencies undoubtedly struggle with the massive amount of textual content that is posted online, some of which has a criminal nature
Some of the most important features after running logistic regression – the Support Vector Machine (SVM) shows similar results – include terms that clearly increase the chances that a user is Dutch
We find that f1 scores are similar across the choices of feature selection methods (0.750)

Summary

Introduction

The police and intelligence agencies undoubtedly struggle with the massive amount of textual content that is posted online, some of which has a criminal nature. Searching for this type of postings on the whole web is a daunting task. We are especially interested in content posted on the so-called dark web, which is more often criminal in nature. The Dutch law enforcement agencies cannot follow up on users involved in criminal online activities that are outside their jurisdiction. Intelligence agencies deal with the magnitude of the Internet featuring criminal content from users with a wide variety of nationalities which is only partially relevant to them. A system to support the identification of Dutch citizens among web users is urgently needed

Objectives

Methods

Results

Conclusion

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Finding Dutch natives in online forums

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Forensic Sciences Research

Lead the way for us

Journal: Forensic Sciences Research	Publication Date: Jul 3, 2018
License type: open-access

Similar Papers

Biphasic majority voting-based comparative COVID-19 diagnosis using chest X-ray images
Kubilay Muhammed Sunnetci ... Ahmet Alkan
Expert Systems With Applications | VOL. 216
Kubilay Muhammed Sunnetci, et. al.Kubilay Muhammed Sunnetci ... Ahmet Alkan
21 Dec 2022
Expert Systems With Applications | VOL. 216

A time series algorithm to predict surgery in neonatal necrotizing enterocolitis
Cheng Cui ... Lu-Quan Li
BMC Medical Informatics and Decision Making | VOL. 24
Cheng Cui, et. al.Cheng Cui ... Lu-Quan Li
18 Oct 2024
BMC Medical Informatics and Decision Making | VOL. 24

Users’ attention behaviors and features in internet forum
Yong-Zhong Sha ... Li Lu
Journal of Industrial Engineering and Management | VOL. 8
Yong-Zhong Sha, et. al.Yong-Zhong Sha ... Li Lu
13 Nov 2015
Journal of Industrial Engineering and Management | VOL. 8

Machine learning based wildfire susceptibility mapping using remotely sensed fire data and GIS: A case study of Adana and Mersin provinces, Turkey
Muzaffer Can Iban ... Aliihsan Sekertekin
Ecological Informatics | VOL. 69
Muzaffer Can Iban, et. al.Muzaffer Can Iban ... Aliihsan Sekertekin
14 Apr 2022
Ecological Informatics | VOL. 69

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Finding Dutch natives in online forums

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Forensic Sciences Research