Abstract

Lateral genetic transfer (LGT) plays an important role in the evolution of microbes. Existing computational methods for detecting genomic regions of putative lateral origin scale poorly to large data. Here, we propose a novel method based on TF-IDF (Term Frequency-Inverse Document Frequency) statistics to detect not only regions of lateral origin, but also their origin and direction of transfer, in sets of hierarchically structured nucleotide or protein sequences. This approach is based on the frequency distributions of k-mers in the sequences. If a set of contiguous k-mers appears sufficiently more frequently in another phyletic group than in its own, we infer that they have been transferred from the first group to the second. We performed rigorous tests of TF-IDF using simulated and empirical datasets. With the simulated data, we tested our method under different parameter settings for sequence length, substitution rate between and within groups and post-LGT, deletion rate, length of transferred region and k size, and found that we can detect LGT events with high precision and recall. Our method performs better than an established method, ALFY, which has high recall but low precision. Our method is efficient, with runtime increasing approximately linearly with sequence length.

Highlights

  • Many microbes can acquire DNA from their environment and incorporate it into their genome via processes of lateral genetic transfer (LGT; known as horizontal gene transfer, HGT)[1]

  • We find that with appropriate parameter values, the algorithm performs with good precision and recall; runtime increases approximately linearly with sequence length, and in most cases Term frequency (TF)-inverse document frequency (IDF) performs much better than ALFY15

  • Since TF-IDF does not detect Lateral genetic transfer (LGT) between sequences within a group, for the comparison we ignore such regions that are detected by ALFY; and if an atypical region is predicted in several sequences of potential donor groups, we treat this result as a single prediction for the calculation of precision and recall

Read more

Summary

Introduction

Many microbes can acquire DNA from their environment and incorporate it into their genome via processes of lateral genetic transfer (LGT; known as horizontal gene transfer, HGT)[1]. ALFY compares the generated tree with a reference, inferring topological incongruence as instances of LGT Another alignment-free method for LGT detection is based on the so-called purity measure[19]. If most substrings of string x, which is itself a substring of string T, appear with the same frequency as x, the purity value of x is high, i.e. subpatterns in x occur infrequently in T outside whole occurrences of x, as would be expected if x had arisen by LGT Both of these alignment-free methods use suffix trees[20] for scalability on large sequence datasets. They consider only one target sequence ( ALFY incorporates a pairwise comparison between query and multiple subject sequences) and do not take into account any natural group structure of the dataset, whether taxonomic (a hierarchy of species, genera etc.), ecological or otherwise

Methods
Results
Conclusion
Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call