Protein Sequence Data Research Articles

In the present paper, deep convolutional neural network (DCNN) is applied to multilocus protein subcellular localization as it is more suitable for multi-class classification. There are two main problems with this application. First, the appropriate features for correlation between multiple sites are hard to find. Second, the classifier structure is difficult to determine as it is greatly affected by the distribution of classified data. To solve these problems, a self-evoluting framework using DCNNs for multilocus protein subcellular localization is proposed. It has three characteristics that the previous algorithms do not. The first is that it combines the ant colony algorithm with the DCNN to form a self-evoluting algorithm for multilocus protein subcellular localization. The second is that it randomly groups subcellular sites using a limited random k-labelsets multi-label classification method. It also solves complex problems in a divide-and-conquer approach and proposes a flexible expansion model. The third is that it realizes the random selection feature extraction method in the positioning process and avoids the defects in individual feature extraction methods. The algorithm in the present paper is tested on the human database, and the overall correct rate is 67.17%, which is higher than that for the stacked self-encoder (SAE), support vector machine (SVM), random forest classifier (RF), or single deep convolutional neural network.Graphical abstract The algorithm mentioned in the present paper mainly includes four parts. They are protein sequence data preprocessing, integrated DCNN model construction, finding optimal DCNN combination by ant colony optimization, and protein subcellular localization for sequences. These parts are sequential relationships and the data obtained in the previous part is the basis for the latter part of the function. In the part of data preprocessing, the limited RAkEL multi-label classification method is used to randomly group subcellular sites. At the same time, the feature fusion of protein sequences is carried out by using multiple feature extraction methods. Each combination including features and sites information corresponds to a DCNN model. In the part of finding optimal DCNN combination by ant colony optimization, the main purpose is to find the best combination of DCNN models through the global optimization ability of the ant colony algorithm. The positioning of sequences is mainly to obtain multilocus subcellular localization by the optimal model combination.

Read full abstract

The secondary structure prediction of proteins is a classic topic of computational structural biology with a variety of applications. During the past decade, the accuracy of prediction achieved by state-of-the-art algorithms has been >80%; meanwhile, the time cost of prediction increased rapidly because of the exponential growth of fundamental protein sequence data. Based on literature studies and preliminary observations on the relationships between the size/homology of the fundamental protein dataset and the speed/accuracy of predictions, we raised two hypotheses that might be helpful to determine the main influence factors of the efficiency of secondary structure prediction. Experimental results of size and homology reductions of the fundamental protein dataset supported those hypotheses. They revealed that shrinking the size of the dataset could substantially cut down the time cost of prediction with a slight decrease of accuracy, which could be increased on the contrary by homology reduction of the dataset. Moreover, the Shannon information entropy could be applied to explain how accuracy was influenced by the size and homology of the dataset. Based on these findings, we proposed that a proper combination of size and homology reductions of the protein dataset could speed up the secondary structure prediction while preserving the high accuracy of state-of-the-art algorithms. Testing the proposed strategy with the fundamental protein dataset of the year 2018 provided by the Universal Protein Resource, the speed of prediction was enhanced over 20 folds while all accuracy measures remained equivalently high. These findings are supposed helpful for improving the efficiency of researches and applications depending on the secondary structure prediction of proteins. To make future implementations of the proposed strategy easy, we have established a database of size and homology reduced protein datasets at http://10.life.nctu.edu.tw/UniRefNR.

Read full abstract

Protein Sequence Data Research Articles

Related Topics

Articles published on Protein Sequence Data

Mask blast with a new chemical logic of amino acids for improved protein function prediction.

Generating functional protein variants with variational autoencoders.

Deep Protein Subcellular Localization Predictor Enhanced with Transfer Learning of GO Annotation

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences

Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Homology modeling and molecular docking simulation of some novel imidazo[1,2-a]pyridine-3-carboxamide (IPA) series as inhibitors of Mycobacterium tuberculosis

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Distinguishing Enzymes and Non-enzymes Based on Structural Information with an Alignment Free Approach

Performance of Regression Models as a Function of Experiment Noise.

DGraph Clusters Flaviviruses and β-Coronaviruses According to Their Hosts, Disease Type, and Human Cell Receptors.

A computational approach for predicting drug–target interactions from protein sequence and drug substructure fingerprint information

A novel positive single-stranded RNA virus from the crustacean parasite, Probopyrinella latreuticola (Peracarida: Isopoda: Bopyridae)

HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network.

Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization.

Protein Sequence Selection Method That Enables Full Consensus Design of Artificial l-Threonine 3-Dehydrogenases with Unique Enzymatic Properties.

Assessment of different screening methods for selecting palaeontological bone samples for peptide sequencing

Engineering a Histone Reader Protein by Combining Directed Evolution, Sequencing, and Neural Network Based Ordinal Regression.

A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy.

Homology Modeling of Distant Lipocalin Homologs Using a Structure-based Fingerprint as a Constraint for Sequence Alignment

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Protein Sequence Data Research Articles

Related Topics

Articles published on Protein Sequence Data

Mask blast with a new chemical logic of amino acids for improved protein function prediction.

Generating functional protein variants with variational autoencoders.

Deep Protein Subcellular Localization Predictor Enhanced with Transfer Learning of GO Annotation

An Investigation of Alternatives to Transform Protein Sequence Databases to a Columnar Index Schema

A Simple Protein Evolutionary Classification Method Based on the Mutual Relations Between Protein Sequences

Sequence representation approaches for sequence-based protein prediction tasks that use deep learning.

Homology modeling and molecular docking simulation of some novel imidazo[1,2-a]pyridine-3-carboxamide (IPA) series as inhibitors of Mycobacterium tuberculosis

Predicting bacteriophage hosts based on sequences of annotated receptor-binding proteins

Distinguishing Enzymes and Non-enzymes Based on Structural Information with an Alignment Free Approach

Performance of Regression Models as a Function of Experiment Noise.

DGraph Clusters Flaviviruses and β-Coronaviruses According to Their Hosts, Disease Type, and Human Cell Receptors.

A computational approach for predicting drug–target interactions from protein sequence and drug substructure fingerprint information

A novel positive single-stranded RNA virus from the crustacean parasite, Probopyrinella latreuticola (Peracarida: Isopoda: Bopyridae)

HumDLoc: Human Protein Subcellular Localization Prediction Using Deep Neural Network.

Self-evoluting framework of deep convolutional neural network for multilocus protein subcellular localization.

Protein Sequence Selection Method That Enables Full Consensus Design of Artificial l-Threonine 3-Dehydrogenases with Unique Enzymatic Properties.

Assessment of different screening methods for selecting palaeontological bone samples for peptide sequencing

Engineering a Histone Reader Protein by Combining Directed Evolution, Sequencing, and Neural Network Based Ordinal Regression.

A simple strategy to enhance the speed of protein secondary structure prediction without sacrificing accuracy.

Homology Modeling of Distant Lipocalin Homologs Using a Structure-based Fingerprint as a Constraint for Sequence Alignment