The most powerful approach to detect distant homologues of a protein is based on structure prediction and comparison. Yet this approach is still inapplicable to many viral proteins. Therefore, we applied a powerful sequence-based procedure to identify distant homologues of viral proteins. It relies on three principles: (1) traces of sequence similarity can persist beyond the significance cutoff of homology detection programmes; (2) candidate homologues can be identified among proteins with weak sequence similarity to the query by using 'contextual' information, e.g. taxonomy or type of host infected; (3) these candidate homologues can be validated using highly sensitive profile-profile comparison. As a test case, this approach was applied to a protein without known homologues, encoded by ORF4 of Lake Sinai viruses (which infect bees). We discovered that the ORF4 protein contains a domain that has homologues in proteins from >20 taxa of viruses infecting arthropods. We called this domain 'widespread, intriguing, versatile' (WIV), because it is found in proteins with a wide variety of functions and within varied domain contexts. For example, WIV is found in the NSs protein of tospoviruses, a global threat to food security, which infect plants as well as their arthropod vectors; in the RNA2 ORF1-encoded protein of chronic bee paralysis virus, a widespread virus of bees; and in various proteins of cypoviruses, which infect the silkworm Bombyx mori. Structural modelling with AlphaFold indicated that the WIV domain has a previously unknown fold, and bibliographical evidence suggests that it facilitates infection of arthropods.
Read full abstract