Solar Irradiance measurements are critical for a broad range of energy systems, including evaluating performance ratios of photovoltaic systems, as well as forecasting power generation. Using sky images to evaluate solar irradiance, allows for a low-cost, low-maintenance, and easy integration into Internet-of-things network, with minimal data loss. This work demonstrates that a vision transformer-based machine learning model can produce accurate irradiance estimates based on sky-images without any auxiliary data being used. The training data utilizes 17 years of global horizontal, diffuse and direct data, based on a high precision pyranometer and pyrheliometer sun-tracked system; in-conjunction with sky images from a standard lens and a fish-eye camera. The vision transformer-based model learns to attend to relevant features of the sky-images and to produce highly accurate estimates for both global horizontal irradiance (RMSE =52 W/m2) and diffuse irradiance (RMSE = 31 W/m2). This work compares the model’s performance on wide field of view all-sky images as well as images from a standard camera and shows that the vision transformer model works best for all-sky images. For images from a normal camera both vision transformer and convolutional architectures perform similarly with the convolution-based architecture showing an advantage for direct irradiance with an RMSE of 155 W/m2.