Promoters are DNA regulatory elements located proximal to the transcription start site, which are in charge of the initiation of specific gene transcription. In Escherichia coli, promoters can be recognized by σ factors that have multiple families based on distinct function and structure, such as σ24, σ28, σ32, σ38, σ54 and σ70. At present, biological methods are mainly used to identify these promoters. However, because it is time-consuming and material-consuming to do biological experiments, computational biology algorithm has emerged as a more effective way to predict the classification. In this study, we develop a novel two-layer seamless predictor called iPro2L-PSTKNC to identify the promoters of the E. coli genome, which based on the feature extraction model we newly proposed that is named as the position specific tendencies of k-mer nucleotide composition (PSTKNC). On the first layer, it is a binary classification predicting whether a sequence is promoter or not. And the second layer is a multiple classification identifying which type the identified promoter belongs to. The ensemble classification SVM performsbest comparing with other algorithms, which gets a promising accuracy and the Matthews correlation coefficient (MCC) at [Formula: see text] and [Formula: see text]. Our data and code are available at https://github.com/lyuyinuo/iPro2L-PSTKNC.
Read full abstract