Beginning with the sequencing of the human genome in the 1990s, major efforts have been underway to expand significantly upon our current working knowledge of proteins. Between the years 2000–2015, the Protein Structure Initiative (PSI) solved 15,000+ protein structures, but most of them have unknown or uncertain biochemical function or have an incorrect assigned putative function. With all these structures, methodology must be developed to functionally annotate these proteins in order to identify potential applications, such as bioremediation and understanding cellular processes. Enzymes in the Haloacid Dehalogenase Superfamily (HADSF) possess a wide range of functions, including phosphatases important in cell membrane biosynthesis and dehalogenases that possesses the ability to detoxify and degrade halogenated compounds. The Ondrechen Research Group (ORG) at Northeastern University has successfully developed methodology that is being used to predict biochemical function for Structural Genomics (SG) proteins in the HADSF. These methods are Partial Order Optimum Likelihood (POOL) and Structurally Aligned Local Sites of Activity (SALSA). POOL is a machine learning method that uses the electrostatic features and metrics from THEoreticalMicroscope Anomalous TItration Curve Shapes (THEMATICS), ligand binding pocket and geometric features from ConCavity, and the evolutionary scores and phylogenetic trees from INformation‐theoretic TREe traversal for Protein functional site IDentification (INTREPID) to make the predictions about which residues are catalytically active or otherwise important for protein function. SALSA uses the functional residue predictions for SG proteins obtained from POOL and aligns them with consensus signatures from known enzyme subfamilies according to the local spatial arrangement of predicted residues at the active site. So far for the HAD superfamily, using SALSA we have made predictions for 20 SG proteins: one dehalogenase, eight sugar phosphatases, three NagD‐like phosphatases, four P‐Type ATPases, and four soluble epoxide hydrolases. These predictions will be experimentally validated by biochemical assays to establish the function of each protein and to verify our computational approach to protein function prediction. The ability to predict computationally the biochemical function of protein structures of unknown or uncertain function adds tremendous value to genomics data.Support or Funding InformationThis project is funded by NSF CHE‐1305655, CHE‐1905214, and CHE‐1757078.
Read full abstract