Score-based likelihood ratios for linguistic text evidence with a bag-of-words model

Shunichi Ishihara

doi:10.1016/j.forsciint.2021.110980

Abstract

The likelihood ratio paradigm for quantifying the strength of evidence has been researched in many fields of forensic science. Within this paradigm, score-based approaches for estimating likelihood ratios are becoming more prevalent in the forensic science literature. In this study, a score-based approach for estimating likelihood ratios is implemented for linguistic text evidence. Text data are represented via a bag-of-words model with the Z-score normalised relative frequencies of selected most-frequent words (the number of the most-frequent words = N), and the Euclidean, Manhattan and Cosine distance measures are trialled as the score-generating functions for comparing paired text samples. The score-to-likelihood-ratio conversion model was built using a common source method, and the best fitting model was selected from the parametric models of the Normal, Log-normal, Gamma and Weibull distributions. With the Amazon Product Data Authorship Verification Corpus, two groups of documents (each group including documents of approximately 700, 1400 and 2100 words) were synthesised for each author, allowing 720 same-author comparisons and 517,680 different-author comparisons to test the validity of the system. A series of experiments was conducted using combinations of the following conditions: the three score functions, the different values of N for the feature vector and the different document lengths. The validity of the system was assessed using the log-likelihood-ratio cost (Cllr), and the strength of the derived likelihood ratios was charted in the form of Tippett plots. It was demonstrated that 1) the Cosine measure consistently outperforms the other measures—the best performance is achieved with N = 260, regardless of the document length (e.g., Cllr values of 0.70640, 0.45314 and 0.30692, respectively, for 700, 1400 and 2100 words)—and 2) the derived likelihood ratios are very well calibrated irrespective of the distance measures and document lengths. A follow-up experiment showed that the described score-based approach is relatively robust and stable for a limited quantity of background data. The derived likelihood ratios that were estimated separately to the three distance measures were logistic regression fused; and the fusion achieved a further improvement in performance—for example, a Cllr of 0.23494 for 2100 words. This study demonstrates the possibility of designing likelihood ratio–based systems that discriminate between same-author and different-author documents.

Full Text