Abstract
In this work, we study the first passage statistics of amino acid primary sequences, that is the probability of observing an amino acid for the first time at a certain number of residues away from a fixed amino acid. By using this rich mathematical framework, we are able to capture the background distribution for an organism, and infer lengths at which the first passage has a probability that differs from what is expected. While many features of an organism's genome are due to natural selection, others are related to amino acid chemistry and the environment in which an organism lives, constraining the randomness of genomes upon which selection can further act. We therefore use this approach to infer amino acid correlations, and then study how these correlations vary across a wide range of organisms under a wide range of optimal growth temperatures. We find a nearly universal exponential background distribution, consistent with the idea that most amino acids are globally uncorrelated from other amino acids in genomes. When we are able to extract significant correlations, these correlations are reliably dependent on optimal growth temperature, across phylogenetic boundaries. Some of the correlations we extract, such as the enhanced probability of finding, for the first time, a cysteine three residues away from a cysteine or glutamic acid two residues away from an arginine, likely relate to thermal stability. However, other correlations, likely appearing on alpha helical surfaces, have a less clear physiochemical interpretation and may relate to thermal stability or unusual metabolic properties of organisms that live in a high temperature environment.
Highlights
First passage statistics provide a natural mathematical framework for analyzing the likelihood of the first occurrence of an event after some initial event [1]
Several authors have noted the effects of optimal growth temperature (OGT) on various amino acid features [4,5,6,7,8,9]
First passage statistics were chosen to take into account the OGT dependence of amino acids in the background distribution, while possessing the statistical power to resolve discontiguous correlations over large lengths
Summary
First passage statistics provide a natural mathematical framework for analyzing the likelihood of the first occurrence of an event after some initial event [1]. What is the best method, both in terms of the underlying mathematics and empirical application to sequence datasets, to infer the set of amino acid correlations in proteins that are dependent on the environment in which they function. We observe a universal exponential background distribution for amino acid first passage distributions across a set of 76 organisms, chosen to represent the range of well-characterized optimal growth temperatures among organisms with fully sequenced proteomes.
Talk to us
Join us for a 30 min session where you can share your feedback and ask us any queries you have
Disclaimer: All third-party content on this website/platform is and will remain the property of their respective owners and is provided on "as is" basis without any warranties, express or implied. Use of third-party content does not indicate any affiliation, sponsorship with or endorsement by them. Any references to third-party content is to identify the corresponding services and shall be considered fair use under The CopyrightLaw.