Empirical evaluations of language-based author identification techniques

Carole E Chaski

doi:10.1558/sll.2001.8.1.1

Abstract

Recent Court decisions in the United States call for the empirical testing of language-based author identification techniques. This article shows the results of such testing. The tested hypotheses include: syntactic analysis, syntactically-classified punctuation, sentential complexity, vocabulary richness, readability, content analysis, spelling errors, punctuation errors, word form errors, and grammatical errors. These hypotheses are tested on a set of documents written by four women who are similar in age, educational level, and dialectal background: two of the women are Euro-American, and two are Afro-American. Each hypothesis is tested separately to determine its ability to differentiate documents from different authors and cluster documents from each author. Hypotheses which quantify linguistic features are tested statistically using the chi-square statistic. Discrimination error rates are calculated. Only two hypotheses successfully differentiate and cluster documents: syntactic analysis and syntactically-classified punctuation.

Full Text