Abstract

Biological sequences from different species are called orthologs if they evolved from a sequence of a common ancestor species and they have the same biological function. Approximations of Kolmogorov complexity or entropy of biological sequences are already well known to be useful in extracting similarity information between such sequences — in the interest, for example, of ortholog detection. As is well known, the exact Kolmogorov complexity is not algorithmically computable. In practice one can approximate it by computable compression methods. However, such compression methods do not provide a good approximation to Kolmogorov complexity for short sequences. Herein is suggested a new approach to overcome the problem that compression approximations may not work well on short sequences. This approach is inspired by new, conditional computations of Kolmogorov entropy. A main contribution of the empirical work described shows the new set of entropy-based machine learning attributes provides good separation between positive (ortholog) and negative (non-ortholog) data — better than with good, previously known alternatives (which do not employ some means to handle short sequences well). Also empirically compared are the new entropy based attribute set and a number of other, more standard similarity attributes sets commonly used in genomic analysis. The various similarity attributes are evaluated by cross validation, through boosted decision tree induction C5.0, and by Receiver Operating Characteristic (ROC) analysis. The results point to the conclusion: the new, entropy based attribute set by itself is not the one giving the best prediction; however, it is the best attribute set for use in improving the other, standard attribute sets when conjoined with them. Keywords—compression, decision tree, entropy, ortholog, ROC.

Full Text
Paper version not known

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.