Objective. Despite the effective application of deep learning (DL) in brain–computer interface (BCI) systems, the successful execution of this technique, especially for inter-subject classification, in cognitive BCI has not been accomplished yet. In this paper, we propose a framework based on the deep convolutional neural network (CNN) to detect the attentive mental state from single-channel raw electroencephalography (EEG) data. Approach. We develop an end-to-end deep CNN to decode the attentional information from an EEG time series. We also explore the consequences of input representations on the performance of deep CNN by feeding three different EEG representations into the network. To ensure the practical application of the proposed framework and avoid time-consuming re-training, we perform inter-subject transfer learning techniques as a classification strategy. Eventually, to interpret the learned attentional patterns, we visualize and analyse the network perception of the attention and non-attention classes. Main results. The average classification accuracy is 79.26%, with only 15.83% of 120 subjects having an accuracy below 70% (a generally accepted threshold for BCI). This is while with the inter-subject approach, it is literally difficult to output high classification accuracy. This end-to-end classification framework surpasses conventional classification methods for attention detection. The visualization results demonstrate that the learned patterns from the raw data are meaningful. Significance. This framework significantly improves attention detection accuracy with inter-subject classification. Moreover, this study sheds light on the research on end-to-end learning; the proposed network is capable of learning from raw data with the least amount of pre-processing, which in turn eliminates the extensive computational load of time-consuming data preparation and feature extraction.