Similarity assessment for removal of noisy end user license agreements

Niklas Lavesson,Stefan Axelsson

doi:10.1007/s10115-011-0438-9

Niklas Lavesson, Stefan Axelsson

Open Access

https://doi.org/10.1007/s10115-011-0438-9

Copy DOI

Abstract

In previous work, we have shown the possibility to automatically discriminate between legitimate software and spyware-associated software by performing supervised learning of end user license agreements (EULAs). However, the amount of false positives (spyware classified as legitimate software) was too large for practical use. In this study, the false positives problem is addressed by removing noisy EULAs, which are identified by performing similarity analysis of the previously studied EULAs. Two candidate similarity analysis methods for this purpose are experimentally compared: cosine similarity assessment in conjunction with latent semantic analysis (LSA) and normalized compression distance (NCD). The results show that the number of false positives can be reduced significantly by removing noise identified by either method. However, the experimental results also indicate subtle performance differences between LSA and NCD. To improve the performance even further and to decrease the large number of attributes, the categorical proportional difference (CPD) feature selection algorithm was applied. CPD managed to greatly reduce the number of attributes while at the same time increase classification performance on the original data set, as well as on the LSA- and NCD-based data sets.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Knowledge and Information Systems	Publication Date: Jul 28, 2011
Citations: 45	License type: pd

R Discovery Prime

R Discovery Prime

Similarity assessment for removal of noisy end user license agreements

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems

Lead the way for us

Similar Papers

Learning to detect spyware using end user license agreements
Niklas Lavesson ... Martin Boldt
Knowledge and Information Systems | VOL. 26
Niklas Lavesson, et. al.Niklas Lavesson ... Martin Boldt
16 Jan 2010
Knowledge and Information Systems | VOL. 26

Automated Spyware Detection Using End User License Agreements
Martin Boldt ... Niklas Lavesson
-
Martin Boldt, et. al.Martin Boldt ... Niklas Lavesson
01 Apr 2008
01 Apr 2008

Latent semantic indexing (LSI) fails for TREC collections
Avinash Atreya ... Charles Elkan
ACM SIGKDD Explorations Newsletter | VOL. 12
Avinash Atreya, et. al.Avinash Atreya ... Charles Elkan
31 Mar 2011
ACM SIGKDD Explorations Newsletter | VOL. 12

Privacy-Invasive Software and Preventive Mechanisms
Martin Boldt ... Bengt Carlsson
-
Martin Boldt, et. al.Martin Boldt ... Bengt Carlsson
01 Oct 2006
01 Oct 2006

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Similarity assessment for removal of noisy end user license agreements

Abstract

Talk to us

Similar Papers

More From: Knowledge and Information Systems