Genomic islands are fragments of foreign DNA that are found in bacterial and archaeal genomes, and are typically associated with symbiosis or pathogenesis. While numerous genomic island detection methods have been proposed, there has been limited evaluation of the efficiency of the genome information processing and boundary recognition tools. In this study, we conducted a review of the statistical methods involved in genomic signatures, host signature extraction, informative signature selection, divergence measures, and boundary detection steps in genomic island prediction. We compared the performances of these methods on simulated experiments using alien fragments obtained from both artificial and real genomes. Our results indicate that among the nine genomic signatures evaluated, genomic signature frequency and full probability performed the best. However, their performance declined when normalized to their expectations and variances, such as Z-score and composition vector. Based on our experiments of the E. coli genome, we found that the confidence intervals of the window variances achieved the best performance in the signature extraction of the host, with the best confidence interval being 1.5–2 times the standard error. Ordered kurtosis was most effective in selecting informative signatures from a single genome, without requiring prior knowledge from other datasets. Among the three divergence measures evaluated, the two-sample t-test was the most successful, and a non-overlapping window with a small eye window (size 2) was best suited for identifying compositionally distinct regions. Finally, the maximum of the Markovian Jensen-Shannon divergence score, in terms of GC-content bias, was found to make boundary detection faster while maintaining a similar error rate.
Read full abstract