Conventionally diagnosing septic arthritis relies on detecting the causal pathogens in samples of synovial fluid, synovium, or blood. However, isolating these pathogens through cultures takes several days, thus delaying both diagnosis and treatment. Establishing a quantitative classification model from ultrasound images for rapid septic arthritis diagnosis is mandatory. For the study, a database composed of 342 images of non-septic arthritis and 168 images of septic arthritis produced by grayscale (GS) and power Doppler (PD) ultrasound was constructed. In the proposed architecture of fusion with attention and selective transformation (FAST), both groups of images were combined in a vision transformer (ViT) with the convolutional block attention module, which incorporates spatial, modality, and channel features. Fivefold cross-validation was applied to evaluate the generalized ability. The FAST architecture achieved the accuracy, sensitivity, specificity, and area under the curve (AUC) of 86.33%, 80.66%, 90.25%, and 0.92, respectively. These performances were higher than using conventional ViT (82.14%) and significantly better than using one modality alone (GS 73.88%, PD 72.02%), with the p-value being less than 0.01. Through the integration of multi-modality and the extraction of multiple channel features, the established model provided promising accuracy and AUC in septic arthritis classification. The end-to-end learning of ultrasound features can provide both rapid and objective assessment suggestions for future clinic use.