Current research widely acknowledges that the subcellular localization of mRNA is crucial for understanding its biological functions. However, current methods for mRNA subcellular localization based on k-mer frequency features may overlook the sequential information of the sequence, and a single encoding method may not adequately extract the sequence’s features. This paper proposes a novel deep learning prediction method, CSpredR, specifically designed for predicting the subcellular localization of multi-site mRNAs. Unlike previous methods, CSpredR first employs k-mer to tokenize the mRNA sequences, then converts the tokenized sequences into de Bruijn graphs, thereby enabling a more precise capture of the structural information within the sequences. To mitigate the impact of lost sequential information and better capture sequence features, we combine word2vec and fasttext models to extract the features of each node in the graph and retain the sequence order. They can encode the k-mer units in the sequence into word vectors, thus serving as the node feature vectors of the graph. In this way, each node in the graph is assigned a feature vector containing rich semantic information. Subsequently, we utilize multi-scale convolutional neural networks and bidirectional long short-term memory networks to capture sequence features, respectively, and fuse the results as input for a multi-head attention mechanism model. The information from these heads is integrated into the node representations, and finally, the attention-processed data are fed into an MLP (Multi-Layer Perceptron) for prediction tasks. Extensive experiments reveal that CSpredR achieves a 2% improvement over the best existing predictors, offering a more effective tool for mRNA subcellular localization prediction.
Read full abstract