Conventional point source detection methods generally work in a pixelwise manner and can hardly exploit the overall semantic information of sources; consequently, these methods usually suffer from low precision. In this work we achieve point source detection in fully patchwise mode by proposing a siamese network called SiamVIT that includes a visual transformer (VIT). SiamVIT can effectively and accurately locate point sources from gamma -ray maps with high purity not only in higher flux regions, but also in lower flux regions, which is extremely challenging to achieve with state-of-the-art methods. SiamVIT consists of two VIT branches and a matching block. In the feature extraction stage, gamma -ray maps are fed into one VIT branch to obtain patch representations with adequate semantic and contextual information, whereas detection templates with location information are fed into the other branch to produce template representations. In the location stage, a patch representation and all template representations are fed into the matching block to determine whether the associated gamma -ray map patch contains a point source and where that point source is located, if applicable. We compare our proposed SiamVIT with the current advanced methods and find that SiamVIT has significantly better purity and completeness and a superior Dice coefficient on the test set. In addition, when point sources overlap, SiamVIT can better distinguish different point sources.