Text recognition in the wild is a challenging task in the field of computer vision and machine learning. Existing optical character recognition engines cannot perform well in the natural scene. In this context, deep learning models have emerged as a powerful state-of-the-art technique in the classification and recognition process. This study proposes a new Convolutional Neural Network based system for scene text reading. We investigate how to combine the character recognition module followed by the word recognition module to achieve the overall system goal. The first module analyzes characters within multi-scale images by relaying on the power of the convolutional network and the fully connected network for character recognition. The second module relies on the Viterbi search to find the closest word to a given characters sequence. For the sake of more precision, a bigram based linguistic module is applied. The proposed system achieves the state-of-the-art performance on three standard scene text recognition benchmarks: chars74k, ICDAR 2003 and ICDAR 2013. In particular, this performance is proven on both of character and word recognition accuracy as well as speed aspects via a comparative study between different deep learning architectures.