Protein Family Classification Research Articles

Data mining in most cases relies on and is limited by a biologist's experience and knowledge. It is concentration- and time-intensive. We have developed an original hierarchical protein family classification, based primarily on protein sequence motifs, functional domains and cellular localization. This protein classification schema has been generated by classifying entries in Interpro 1.2, an integrated resource of protein families, domains and sites1. By using this new classification, one can effectively and rapidly mine data on proteins involved in oncogenesis. This classification is especially tailored toward the biopharmaceutical industry and drug discovery efforts, and it has been extensively used in Hyseq's internal data-mining processes. The hierarchical protein classification has 9 main classes, 56 subclasses, and 3,052 Interpro entries. These Interpro entries represent 574 domains, 2,418 families, 46 repeats and 14 post-translational modification sites from clustering PRINTS, PROSITE, ProDom, SWISS-PROT, TrEMBL and Pfam data. We generate Pfam models by multiple protein sequence alignments and express them mathematically in the form of hidden Markov models. To demonstrate the utility of our classification schema, we searched the SWISS-PROT and TrEMBL databases with Pfam models. Of the 31% of protein sequences that had significant Pfam hits, 4,520 sequences were in the cancer subclass. At present 59 Interpro entries exist in this subclass. These entries include breast cancer susceptibility proteins, retinoblastoma protein domains, p53 tumor antigen, Xeroderma pigmentosum proteins and the Burkitt's lymphoma receptor.

Protein sequence classification is becoming an increasingly important means of organizing the voluminous data produced by large-scale genome sequencing projects. At present, there are several independent classification methods. To aid the general classification effort, we have created a unified protein family resource, MetaFam. MetaFam is a protein family classification built upon 10 publicly-accessible protein family databases (Blocks + DOMO, Pfam, PIR-ALN, PRINTS, PROSITE, ProDom, PROTOMAP, SBASE, and SYSTERS). MetaFam's family 'supersets', as we call them, are created automatically using set-theory to compare families among the databases. Families of one database are matched to those in another when the intersection of their members exceeds all other possible family pairings between the two databases. Pairwise family matches are drawn together transitively to create a new list of protein family supersets. MetaFam family supersets have several useful features: (1) each superset contains more members than the families from which it is composed, because each of the component family databases only works with a subset of our full non-redundant set of proteins; (2) conflicting assignments can be pinpointed quickly, since our analysis identifies individual members that are in conflict with the majority consensus; (3) family descriptions that are absent from automated databases can frequently be assigned; (4) statistics have been computed comparing domain boundaries, family size distributions, and overall quality of MetaFam supersets; (5) the supersets have been loaded into a relational database to allow for complex queries and visualization of the connections among families in a superset and the consensus of individual domain members; and (6) the quality of individual supersets has been assessed using numerous quantitative measures such as family consistency, connectedness, and size. We anticipate this new resource will be particularly useful to genomic database curators.

Protein Family Classification Research Articles

Related Topics

Articles published on Protein Family Classification

PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

An efficient algorithm for large-scale detection of protein families

Classification of Transmembrane Protein Families Based on Topological Similarity

Structural and evolutionary relationships among protein tyrosine phosphatase domains.

Efficient data mining of proteins involved in carcinogenesis by functional classification using Interpro

MetaFam: a unified classification of protein families. II. Schema and query capabilities.

MetaFam: a unified classification of protein families. I. Overview and statistics.

Classification of Eukaryotic 7-tms Transmembrane Proteins by Binary Topology Pattern

Functional classification of cNMP-binding proteins and nucleotide cyclases with implications for novel regulatory pathways in Mycobacterium tuberculosis.

Classification and function prediction of proteins using diagnostic amino acid patterns.

Protein Family Classification Based on Searching a Database of Blocks

A unique method to represent proteins

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Protein Family Classification Research Articles

Related Topics

Articles published on Protein Family Classification

PANTHER: a browsable database of gene products organized by biological function, using curated protein family and subfamily classification.

An efficient algorithm for large-scale detection of protein families

Classification of Transmembrane Protein Families Based on Topological Similarity

Structural and evolutionary relationships among protein tyrosine phosphatase domains.

Efficient data mining of proteins involved in carcinogenesis by functional classification using Interpro

MetaFam: a unified classification of protein families. II. Schema and query capabilities.

MetaFam: a unified classification of protein families. I. Overview and statistics.

Classification of Eukaryotic 7-tms Transmembrane Proteins by Binary Topology Pattern

Functional classification of cNMP-binding proteins and nucleotide cyclases with implications for novel regulatory pathways in Mycobacterium tuberculosis.

Classification and function prediction of proteins using diagnostic amino acid patterns.

Protein Family Classification Based on Searching a Database of Blocks

A unique method to represent proteins