Discovery of patterns in LZ-78 text discrimination

Mikhail Malyutov,Gabriel Cunningham

doi:10.1109/sibircon.2010.5555303

Abstract

We continue studying a new context-free computationally simple stylometry-based text homogeneity test: the sliced conditional compression complexity (sCCC or simply CCC) of literary texts introduced and inspired by the incomputable Kolmogorov conditional complexity. Other stylometry tools can occasionally almost coincide for different authors. Our CCC-attributor is asymptotically strictly minimal for the true author, if the query texts are sufficiently large but much less than the training texts, universal compressor is good and sampling bias is avoided. This classifier simplifies the homogeneity test (partly based on compression) under insignificant difference of unconditional complexities of training and query texts verifiable via its asymptotic normality for IID and Markov sources and empirical normal plots for real literary texts. It is consistent under large text approximation as a stationary ergodic sequence due to the lower bound for the minimax compression redundancy of piecewise stationary strings, our elementary combinatorial arguments and simulation in present paper for IID sources. The CCC is based on the t-ratio measuring how many standard deviations are in the mean difference of slices' CCC which enables evaluation of the corresponding P-value of statistical significance based on slices' CCC asymptotic normality empirically verified by their normal plots in all cases studied and expected to be proved soon for simplified statistical models of literary texts. While good CCC performance holds for a broad class of fast adapting universal compressors introduced, the new application of one of them (LZ-78) is described in this paper (which was first outlined in less detail and without application results): the discovery of patterns contributed the most to the CCC discrimination of texts. We present an algorithm implemented in a PERL program by the second author (available by request) and results of its performance on simulated strings and on Federalist papers. A much shorter CCC program in PERL written by the second author is in our Appendix. We hope that experimenting with these two programs will help the CCC dissemination. The asymptotic CCC study was complemented so far by many literary case studies, showing consistency in attributing all Madison Federalist papers to Madison, Brodsky's novels to Brodsky, etc., and a good discrimination between works of different authors in contrast to at least questionable performance of on the same texts. Further testing and tuning CCC-methodology and pattern discovery seems appropriate. New areas of applications for the universal compressor - based nonparametric testing such as change-point detection for arbitrary stationary ergodic sources emerge recently.

Full Text