Abstract

With the preponderance of harassment and abuse, social media platforms and online discussion platforms seek to curb toxic comments. Google's Perspective aims to help platforms classify toxic comments. We have created a pipeline to modify toxic comments to evade Perspective. This pipeline uses existing adversarial machine learning attacks to find the optimal perturbation which will evade the model. Since these attacks typically target images, as opposed to discrete text data, we include a process to generate text candidates from perturbed features and select candidates to retain syntactic similarity. We demonstrated that using a model with just 10,000 queries, changing three words in each comment evades Perspective 25% of the time, suggesting that building a surrogate model may not require many queries and a more robust approach is needed to improve the toxic comment classifier accuracy.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.