Abstract

The symbolic nature of biological sequence data greatly complicates its analysis. As many of the most powerful, widely used analysis techniques work exclusively in the context of Euclidean spaces, methods of embedding DNA, RNA, and protein in such spaces are of critical importance. Here we examine the utility of multilateration as the foundation for representing biological in Euclidean space. With respect to discrete metric spaces, this technique is closely related to the concept of metric dimension from graph theory. The goal is to discover a subset of vertices of minimum size such that all vertices in the graph may be uniquely identified based on distances to the vertices in this set. Multilateration is analogous to trilateration, the process of identifying points in the plane using distances to three non-colinear points. Interpreting the space of all k-mers as a Hamming graph, we are able to find such sets efficiently. Resulting sequence representations tend to be more compact than traditional binary or k-mer count vectors and, unlike Multidimensional Scaling (MDS) and Node2Vec, they apply over all k-mers and do not need to be recomputed when new data is encountered. To test the efficacy and practicality of multilateration we classify DNA $20$-mers centered at intron-exon boundaries in the Drosophila melanogaster genome using features derived from binary and k-mer count vectors as well as MDS, Node2Vec, and multilateration. The performance of multilateration-based features is competitive with other techniques and allows long genomic to be embedded efficiently. This highlight showcases the key findings in Low-dimensional representation of genomic sequences [J Math Biol. 2019 Jul;79(1):1-29].

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.