ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed.

Paola Turina,Emidio Capriotti,Piero Fariselli

doi:10.3389/fmolb.2021.620475

Paola Turina, Emidio Capriotti + Show 1 more

Open Access

https://doi.org/10.3389/fmolb.2021.620475

Copy DOI

Abstract

During the last years, the increasing number of DNA sequencing and protein mutagenesis studies has generated a large amount of variation data published in the biomedical literature. The collection of such data has been essential for the development and assessment of tools predicting the impact of protein variants at functional and structural levels. Nevertheless, the collection of manually curated data from literature is a highly time consuming and costly process that requires domain experts. In particular, the development of methods for predicting the effect of amino acid variants on protein stability relies on the thermodynamic data extracted from literature. In the past, such data were deposited in the ProTherm database, which however is no longer maintained since 2013. For facilitating the collection of protein thermodynamic data from literature, we developed the semi-automatic tool ThermoScan. ThermoScan is a text mining approach for the identification of relevant thermodynamic data on protein stability from full-text articles. The method relies on a regular expression searching for groups of words, including the most common conceptual words appearing in experimental studies on protein stability, several thermodynamic variables, and their units of measure. ThermoScan analyzes full-text articles from the PubMed Central Open Access subset and calculates an empiric score that allows the identification of manuscripts reporting thermodynamic data on protein stability. The method was optimized on a set of publications included in the ProTherm database, and tested on a new curated set of articles, manually selected for presence of thermodynamic data. The results show that ThermoScan returns accurate predictions and outperforms recently developed text-mining algorithms based on the analysis of publication abstracts. Availability: The ThermoScan server is freely accessible online at https://folding.biofold.org/thermoscan. The ThermoScan python code and the Google Chrome extension for submitting visualized PMC web pages to the ThermoScan server are available at https://github.com/biofold/ThermoScan.

Highlights

A key aspect for characterizing the relationship between genotype and phenotype is the study of the impact of amino acid variants on protein function and structure (Thusberg and Vihinen, 2009; Compiani and Capriotti, 2013)
We present the results achieved by ThermoScan in the selection of manuscripts reporting experimental protein thermodynamic data from PubMed
The method based on the maximum score achieved 3% higher accuracy (Q2) and 5% higher Matthews correlation coefficient (MCC)

Summary

INTRODUCTION

A key aspect for characterizing the relationship between genotype and phenotype is the study of the impact of amino acid variants on protein function and structure (Thusberg and Vihinen, 2009; Compiani and Capriotti, 2013) To address this task, several tools for predicting the effect of variants on protein stability have been developed (Sanavia et al, 2020). Text-mining tools are used in daily life science research activity to improve web search (Ananiadou et al, 2010) and facilitate the database curation process (Yeh et al, 2003; Wei et al, 2012; Karp, 2016) In this context, we developed ThermoScan, a new method for facilitating the collection and curation of thermodynamic data. We evaluated the performance of ThermoScan in the detection of thermodynamic data in comparison with two existing web-server tools for documents classification (Fontaine et al, 2009; Simon et al, 2019)

METHODS

Method optimization and Testing

RESULTS

Method

DATA AVAILABILITY STATEMENT

DISCUSSION

Full Text

Paper version not known

Open DOI Link

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call

Journal: Frontiers in Molecular Biosciences	Publication Date: Mar 25, 2021
Citations: 6	License type: CC BY 4.0

R Discovery Prime

R Discovery Prime

ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Molecular Biosciences

Lead the way for us

Similar Papers

Open Access Works are as Reliable as Other Publishing Models at Retracting Flawed Articles from the Biomedical Literature
Elizabeth Margaret Stovold
Evidence Based Library and Information Practice | VOL. 9
Elizabeth Margaret StovoldElizabeth Margaret Stovold
09 Sep 2014
Evidence Based Library and Information Practice | VOL. 9

PageRank as a method to rank biomedical literature by importance.
Elliot J Yates ... Louise C Dixon
Source Code for Biology and Medicine | VOL. 10
Elliot J Yates, et. al.Elliot J Yates ... Louise C Dixon
01 Dec 2015
Source Code for Biology and Medicine | VOL. 10

FullMeSH: improving large-scale MeSH indexing with full text.
Suyang Dai ... Robert Murphy
Bioinformatics | VOL. 36
Suyang Dai, et. al.Suyang Dai ... Robert Murphy
09 Oct 2019
Bioinformatics | VOL. 36

Selection maintaining protein stability at equilibrium
Sanzo Miyazawa
Journal of Theoretical Biology | VOL. 391
Sanzo MiyazawaSanzo Miyazawa
08 Dec 2015
Journal of Theoretical Biology | VOL. 391

Editage

Paperpal

R Discovery

Mind the Graph

R Discovery Prime

R Discovery Prime

ThermoScan: Semi-automatic Identification of Protein Stability Data From PubMed.

Abstract

Highlights

Summary

Talk to us

Similar Papers

More From: Frontiers in Molecular Biosciences