RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Wenjun Li,Marc Gwadz,Jiyao Wang,George Coulouris,Narmada Thanki,Azat Badretdin,Michael Dicuccio,Vyacheslav Chetvernin,Farideh Chitsaz,Myra K Derbyshire ,Daniel H Haft ,Roxanne A Yamashita ,Noreen R Gonzales ,A Scott Durkin ,Jakyoung Song ,Aron Marchler‐Bauer ,Kathleen O’neill ,Mei‐Jie Yang ,Christopher J Lanczycki ,Zheng Chen ,Françoise Thibaud‐Nissen

doi:10.1093/nar/gkaa1105

Abstract

The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found athttps://www.ncbi.nlm.nih.gov/refseq/.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Nucleic acids research	Publication Date: Dec 3, 2020
Citations: 650	License type: cc-by-nc

R Discovery Prime

R Discovery Prime

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Abstract

Talk to us

Similar Papers

More From: Nucleic acids research

Lead the way for us

Similar Papers

RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes.
Daniel H Haft ... Wenjun Li
Nucleic Acids Research | VOL. 52
Daniel H Haft, et. al.Daniel H Haft ... Wenjun Li
14 Nov 2023
Nucleic Acids Research | VOL. 52

RefSeq: an update on prokaryotic genome annotation and curation.
...
Nucleic Acids Research | VOL. 46
, et. al. ...
03 Nov 2017
Nucleic Acids Research | VOL. 46

High-Quality Complete Genome Resource of Pectobacterium parvum Isolate FN20211 CausingAerial Stem Rot of Potato.
Jinhui Wang ... Minna Pirhonen
Molecular plant-microbe interactions : MPMI | VOL. 35
Jinhui Wang, et. al.Jinhui Wang ... Minna Pirhonen
01 May 2022
High-Quality Complete Genome Resource of Pectobacterium parvum Isolate FN20211 CausingAerial Stem Rot of Potato.
Jinhui Wang ... Minna Pirhonen

Data on genome sequencing, assembly, annotation and genomic analysis of Rhodococcus rhodochrous strain SPC17 isolated from Lonar Lake
Satish Kumar ... Mangesh Suryavanshi
Data in Brief | VOL. 29
Satish Kumar, et. al.Satish Kumar ... Mangesh Suryavanshi
26 Feb 2020
Data in Brief | VOL. 29

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation.

Abstract

Talk to us

Similar Papers

More From: Nucleic acids research