The origin of the trademark similarity analysis problem lies within the legal area, specifically the protection of intellectual property. One of the possible technical solutions for this issue is the trademark similarity evaluation pipeline based on the content-based image retrieval approach. CNN-based off-the-shelf features have shown themselves as a good baseline for trademark retrieval. However, in recent years, the computer vision area has been transitioning from CNNs to a new architecture, namely, Vision Transformer. In this paper, we investigate the performance of off-the-shelf features extracted with vision transformers and explore the effects of pre-, post-processing, and pre-training on big datasets. We propose the enhancement of the trademark similarity evaluation pipeline by joint usage of global and local features, which leverages the best aspects of both approaches. Experimental results on the METU Trademark Dataset show that off-the-shelf features extracted with ViT-based models outperform off-the-shelf features from CNN-based models. The proposed method achieves a mAP value of 31.23, surpassing previous state-of-the-art results. We assume that the usage of an enhanced trademark similarity evaluation pipeline allows for the improvement of the protection of intellectual property with the help of artificial intelligence methods. Moreover, this approach enables one to identify cases of unfair use of such data and form an evidence base for litigation.