Abstract

The number of cases involving Child Sexual Abuse Material (CSAM) has increased dramatically in recent years, resulting in significant backlogs. To protect children in the suspect’s sphere of influence, immediate identification of self-produced CSAM among acquired CSAM is paramount. Currently, investigators often rely on an approach based on a simple metadata search. However, this approach faces scalability limitations for large cases and is ineffective against anti-forensic measures. Therefore, to address these problems, we bridge the gap between digital forensics and state-of-the-art data science clustering approaches. Our approach enables clustering of more than 130,000 images, which is eight times larger than previous achievements, using commodity hardware and within an hour with the ability to scale even further. In addition, we evaluate the effectiveness of our approach on seven publicly available forensic image databases, taking into account factors such as anti-forensic measures and social media post-processing. Our results show an excellent median clustering-precision (Homogeinity) of 0.92 on native images and a median clustering-recall (Completeness) of over 0.92 for each test set. Importantly, we provide full reproducibility using only publicly available algorithms, implementations, and image databases.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call