The recent advances in developing assistive devices have attracted researchers to use visual imagery (VI) mental tasks as a control paradigm to design brain–computer interfaces that can produce a large number of control signals. Consequently, this can facilitate the design of control mechanisms that allow locked-in individuals to interact with the surrounding world. This paper presents a two-phase approach for decoding visually imagined digits and letters using electroencephalography (EEG) signals. The first phase employs the Choi–Williams time–frequency distribution (CWD) to construct a joint time, frequency, and spatial (TFS) representation of the EEG signals. The constructed joint TFS representation characterizes the variations in the energy encapsulated within the EEG signals over the TFS domains. The second phase presents a novel deep learning (DL) framework to automatically extract features from the constructed joint TFS representation of the EEG signals and decode the imagined digits and letters. The performance of our approach is assessed using an EEG dataset that was acquired for 16 healthy participants while imagining decimal digits and uppercase English letters. Our approach achieved an average ± standard deviation accuracy of 95.47±2.3%, which is significantly outperforming the accuracies obtained when the CWD is replaced with two alternative time–frequency analysis techniques, the accuracies obtained using four pre-trained DL models, and the accuracies obtained using CWD-based handcrafted features that are classified using four conventional classifiers. Moreover, the results of our proposed approach outperform those reported by several previous studies with regard to the accuracy and number of classes.