Fixed-length Feature Vector Research Articles

Simple SummaryThe family of coronaviruses comprises a diverse set of strains and variants which cause diseases from the common cold to COVID-19. Moreover, they infect a wide array of hosts from bats, camels, birds, to humans. Studying coronaviruses through the lens of host specificity provides a unique perspective to understanding the evolution, diversity and dynamics of this family. In particular, this can reveal groups of different hosts infected by similar strains, giving clues on strains which were more likely to have evolved to jump from one host to another. In this work, we frame host specificity as a classification task, in designing a very compact numerical representation of the spike sequences of different coronaviruses. Based on this numerical representation, classification methods are able to detect the target host with high accuracy. Such an approach can used to efficiently scale to large volumes of sequences, in order to unveil trends in the host specificity of different coronavirus strains.The study of host specificity has important connections to the question about the origin of SARS-CoV-2 in humans which led to the COVID-19 pandemic—an important open question. There are speculations that bats are a possible origin. Likewise, there are many closely related (corona)viruses, such as SARS, which was found to be transmitted through civets. The study of the different hosts which can be potential carriers and transmitters of deadly viruses to humans is crucial to understanding, mitigating, and preventing current and future pandemics. In coronaviruses, the surface (S) protein, or spike protein, is important in determining host specificity, since it is the point of contact between the virus and the host cell membrane. In this paper, we classify the hosts of over five thousand coronaviruses from their spike protein sequences, segregating them into clusters of distinct hosts among birds, bats, camels, swine, humans, and weasels, to name a few. We propose a feature embedding based on the well-known position weight matrix (PWM), which we call PWM2Vec, and we use it to generate feature vectors from the spike protein sequences of these coronaviruses. While our embedding is inspired by the success of PWMs in biological applications, such as determining protein function and identifying transcription factor binding sites, we are the first (to the best of our knowledge) to use PWMs from viral sequences to generate fixed-length feature vector representations, and use them in the context of host classification. The results on real world data show that when using PWM2Vec, machine learning classifiers are able to perform comparably to the baseline models in terms of predictive performance and runtime—in some cases, the performance is better. We also measure the importance of different amino acids using information gain to show the amino acids which are important for predicting the host of a given coronavirus. Finally, we perform some statistical analyses on these results to show that our embedding is more compact than the embeddings of the baseline models.

Read full abstract

Entity Resolution (ER) is defined as the process 0f identifying records/ objects that correspond to real-world objects/ entities. To define a good ER approach, the schema of the data should be well-known. In addition, schema alignment of multiple datasets is not an easy task and may require either domain expert or ML algorithm to select which attributes to match. Schema agnostic meta-blocking tries to solve such a problem by considering each token as a blocking key regardless of the attributes it appears in. It may also be coupled with meta-blocking to reduce the number of false negatives. However, it requires the exact match of tokens which is very hard to occur in the actual datasets and it results in very low precision. To overcome such issues, we propose a novel and efficient ER approach for big data implemented in Apache Spark. The proposed approach is employed to avoid schema alignment as it treats the attributes as a bag of words and generates a set of n-grams which is transformed to vectors. The generated vectors are compared using a chosen similarity measure. The proposed approach is a generic one as it can accept all types of datasets. It consists of five consecutive sub-modules: 1) Dataset acquisition, 2) Dataset pre-processing, 3) Setting selection criteria, where all settings of the proposed approach are selected such as the used blocking key, the significant attributes, NLP techniques, ER threshold, and the used scenario of ER, 4) ER pipeline construction, and 5) Clustering where the similar records are grouped into the similar cluster. The ER pipeline could accept two types of attributes; the Weighted Attributes (WA) or the Compound Attributes (CA). In addition, it accepts all the settings selected in the fourth module. The pipeline consists of five phases. Phase 1) Generating the tokens composing the attributes. Phase 2) Generating n-grams of length n. Phase 3) Applying the hashing Text Frequency (TF) to convert each n-grams to a fixed-length feature vector. Phase 4) Applying Locality Sensitive Hashing (LSH), which maps similar input items to the same buckets with a higher probability than dissimilar input items. Phase 5) Classification of the objects to duplicates or not according to the calculated similarity between them. We introduced seven different scenarios as an input to the ER pipeline. To minimize the number of comparisons, we proposed the length filter which greatly contributes to improving the effectiveness of the proposed approach as it achieves the highest F-measure between the existing computational resources and scales well with the available working nodes. Three results have been revealed: 1) Using the CA in the different scenarios achieves better results than the single WA in terms of efficiency and effectiveness. 2) Scenario 3 and 4 Achieve the best performance time because using Soundex and Stemming contribute to reducing the performance time of the proposed approach. 3) Scenario 7 achieves the highest F-measure because by utilizing the length filter, we only compare records that are nearly within a pre-determined percentage of increase or decrease of string length. LSH is used to map the same inputs items to the buckets with a higher probability than dis-similar ones. It takes numHashTables as a parameter. Increasing the number of candidate pairs with the same numHashTables will reduce the accuracy of the model. Utilizing the length filter helps to minimize the number of candidates which in turn increases the accuracy of the approach.

Read full abstract

Fixed-length Feature Vector Research Articles

Related Topics

Articles published on Fixed-length Feature Vector

On the optimality of quantum circuit initial mapping using reinforcement learning

PseAAC2Vec protein encoding for TCR protein sequence classification

Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python.

Self-Organizing Neural Scheduler for the Flexible Job Shop Problem With Periodic Maintenance and Mandatory Outsourcing Constraints.

Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers

Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data.

TRAL: A Tag-Aware Recommendation Algorithm Based on Attention Learning

Deep learning in business analytics: A clash of expectations and reality

SummerTime: Variable-length Time Series Summarization with Application to Physical Activity Analysis

PredAoDP: Accurate identification of antioxidant proteins by fusing different descriptors based on evolutionary information with support vector machine

Estimating the degree of conflict in speech by employing Bag-of-Audio-Words and Fisher Vectors

Efficient analysis of COVID-19 clinical data using machine learning models.

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences.

Scale-invariant histogram of oriented gradients: novel approach for pedestrian detection in multiresolution image dataset

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition.

An Effective Entity Resolution Approach for Big Data

Robust and accurate prediction of protein\u2013protein interactions by exploiting evolutionary information

Accurate Identification of Antioxidant Proteins Based on a Combination of Machine Learning Techniques and Hidden Markov Model Profiles.

IT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles.

Lead the way for us

Editage

Paperpal

R Discovery

Mind the Graph

Fixed-length Feature Vector Research Articles

Related Topics

Articles published on Fixed-length Feature Vector

On the optimality of quantum circuit initial mapping using reinforcement learning

PseAAC2Vec protein encoding for TCR protein sequence classification

Structural Outlier Detection and Zernike-Canterakis Moments for Molecular Surface Meshes-Fast Implementation in Python.

Self-Organizing Neural Scheduler for the Flexible Job Shop Problem With Periodic Maintenance and Mandatory Outsourcing Constraints.

Embedding-Based Deep Neural Network and Convolutional Neural Network Graph Classifiers

Reads2Vec: Efficient Embedding of Raw High-Throughput Sequencing Reads Data.

TRAL: A Tag-Aware Recommendation Algorithm Based on Attention Learning

Deep learning in business analytics: A clash of expectations and reality

SummerTime: Variable-length Time Series Summarization with Application to Physical Activity Analysis

PredAoDP: Accurate identification of antioxidant proteins by fusing different descriptors based on evolutionary information with support vector machine

Estimating the degree of conflict in speech by employing Bag-of-Audio-Words and Fisher Vectors

Efficient analysis of COVID-19 clinical data using machine learning models.

PWM2Vec: An Efficient Embedding Approach for Viral Host Specification from Coronavirus Spike Sequences.

Scale-invariant histogram of oriented gradients: novel approach for pedestrian detection in multiresolution image dataset

Robust Representation and Efficient Feature Selection Allows for Effective Clustering of SARS-CoV-2 Variants

Statistical and Visual Analysis of Audio, Text, and Image Features for Multi-Modal Music Genre Recognition.

An Effective Entity Resolution Approach for Big Data

Robust and accurate prediction of protein\u2013protein interactions by exploiting evolutionary information

Accurate Identification of Antioxidant Proteins Based on a Combination of Machine Learning Techniques and Hidden Markov Model Profiles.

IT4SE-EP: Accurate Identification of Bacterial Type IV Secreted Effectors by Exploring Evolutionary Features from Two PSI-BLAST Profiles.