On average sequence complexity

Svante Janson,Stefano Lonardi,Wojciech Szpankowski

doi:10.1016/j.tcs.2004.06.023

Abstract

In this paper we study the average behavior of the number of distinct substrings in a text of size n over an alphabet of cardinality k. This quantity is called the complexity index and it captures the “richness of the language” used in a sequence. For example, sequences with low complexity index contain a large number of repeated substrings and they eventually become periodic (e.g., tandem repeats in a DNA sequence). In order to identify unusually low- or high-complexity strings one needs to determine how far are the complexities of the strings under study from the average or maximum string complexity. While the maximum string complexity was studied quite extensively in the past, to the best of our knowledge there are no results concerning the average complexity. We first prove that for a sequence generated by a mixing model (which includes Markov sources) the average complexity is asymptotically equal to n 2 / 2 which coincides with the maximum string complexity. However, for a memoryless source we establish a more precise result, namely the average string complexity is n 2 / 2 - n log k n + 1 + ( 1 - γ ) / ln k + φ k ( log k n ) + o ( 1 ) n where γ ≈ 0.577 and φ k ( x ) is a periodic function with a small amplitude for small alphabet size.

Full Text