Abstract

BackgroundBiological research is generating high volumes of data distributed across various sources. The inconsistent naming of proteins and their encoding genes brings great challenges to protein data integration: proteins and their coding genes usually have multiple related names and notations, which are difficult to match absolutely; the nomenclature of genes and proteins is complex and varies from species to species; some less studied species have no nomenclature of genes and proteins; The annotation of the same protein/gene varies greatly in different databases. In summary, a comprehensive set of protein/gene synonyms is necessary for relevant studies.ResultsIn this study, we propose an approach for protein and its encoding gene synonym integration based on protein ontology. The workflow of protein and gene synonym integration is composed of three modules: data acquisition, entity and attribute alignment, attribute integration and deduplication. Finally, the integrated synonym set of proteins and their coding genes contains over 128.59 million terminologies covering 560,275 proteins/genes and 13,781 species. As the semantic basis, the comprehensive synonym set was used to develop a data platform to provide one-stop data retrieval without considering the diversity of protein nomenclature and species.ConclusionThe synonym set constructed here can serve as an important resource for biological named entity identification, text mining and information retrieval without name ambiguity, especially synonyms associated with well-defined species categories can help to study the evolutionary relationships between species at the molecular level. More importantly, the comprehensive synonyms set is the semantic basis for our subsequent studies on Protein–protein Interaction (PPI) knowledge graph.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call