Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Melanie Vollmar,Santosh Tirunagari,Deborah Harrus,David Armstrong,Romana Gáborová,Deepti Gupta,Marcelo Querino Lima Afonso,Genevieve Evans,Sameer Velankar

doi:10.1038/s41597-024-03841-9

Abstract

We present a novel system that leverages curators in the loop to develop a dataset and model for detecting structure features and functional annotations at residue-level from standard publication text. Our approach involves the integration of data from multiple resources, including PDBe, EuropePMC, PubMedCentral, and PubMed, combined with annotation guidelines from UniProt, and LitSuggest and HuggingFace models as tools in the annotation process. A team of seven annotators manually curated ten articles for named entities, which we utilized to train a starting PubmedBert model from HuggingFace. Using a human-in-the-loop annotation system, we iteratively developed the best model with commendable performance metrics of 0.90 for precision, 0.92 for recall, and 0.91 for F1-measure. Our proposed system showcases a successful synergy of machine learning techniques and human expertise in curating a dataset for residue-level functional annotations and protein structure features. The results demonstrate the potential for broader applications in protein research, bridging the gap between advanced machine learning models and the indispensable insights of domain experts.

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

R Discovery Prime

R Discovery Prime

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Abstract

Talk to us

Similar Papers

More From: Scientific Data

Lead the way for us

Journal: Scientific Data	Publication Date: Sep 27, 2024
License type: cc-by

Similar Papers

Data, Machine Learning, and Human Domain Experts: None Is Better than Their Collaboration
Pawan Kumar ... Manmohan Sharma
International Journal of Human–Computer Interaction | VOL. ahead-of-print
Pawan Kumar, et. al.Pawan Kumar ... Manmohan Sharma
16 Dec 2021
International Journal of Human–Computer Interaction | VOL. ahead-of-print

Proposing artificial intelligence based livelihood vulnerability index in river islands
Swapan Talukdar ... Pankaj Singha
Journal of Cleaner Production | VOL. 284
Swapan Talukdar, et. al.Swapan Talukdar ... Pankaj Singha
17 Oct 2020
Journal of Cleaner Production | VOL. 284

Machine Learning Outperforms Regression Analysis to Predict Next-Season Major League Baseball Player Injuries: Epidemiology and Validation of 13,982 Player-Years From Performance and Injury Profile Trends, 2000-2017.
Jaret M Karnuta ... Eric C Makhni
Orthopaedic Journal of Sports Medicine | VOL. 8
Jaret M Karnuta, et. al.Jaret M Karnuta ... Eric C Makhni
01 Nov 2020
Orthopaedic Journal of Sports Medicine | VOL. 8

Designing an Early-Warning System to Forecast Extreme Climate Conditions Using Data-Driven Approaches with Machine-Learning and Deep-Learning Methods
Afshin Shafei ... Francesco Cioffi
-
Afshin Shafei, et. al.Afshin Shafei ... Francesco Cioffi
08 Mar 2024
08 Mar 2024

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

Dataset from a human-in-the-loop approach to identify functionally important protein residues from literature

Abstract

Talk to us

Similar Papers

More From: Scientific Data