Failure of industrial assets can cause financial, operational and safety hazards across different industries. Monitoring their condition is crucial for successful and smooth operations. The colossal volume of sensory data generated and acquired throughout industrial operations supports real-time condition monitoring of these assets. Leveraging digital technologies to analyze acquired data creates an ideal environment for applying advanced data-driven machine learning techniques, such as convolutional neural networks (CNNs) and vision transformer (ViT) to detect faults and classify, enabling accurate prediction and timely maintenance of industrial assets. In this paper, we present a novel hybrid framework based on the local feature extraction ability of CNN with comprehensive understanding of transformer within a global context. The proposed method leverages the complex weight-sharing properties of CNNs and ability of transformers to understand the larger context of spatial relationships in large-scale patterns, making it applicable to datasets of varying sizes. Preprocessing methods such as data augmentation are used to train the model on the Case Western Reserve University (CWRU) dataset in order to increase generalization through computational efficiency. An average fault classification accuracy of 99.62% is accomplished over all three fault classes with an average time-to-fault detection of 38.4 ms. MFPT fault dataset is used to further validate the method with an accuracy of 99.17% for outer race and 99.26% for inner race. Moreover, the proposed framework can be modified to accommodate alternative convolutional models.