Sign language recognition (SLR) enables hearing-impaired people to better communicate with able-bodied individuals. The diversity of multiple modalities can be utilized to improve SLR. However, existing multimodal fusion methods do not take into account multimodal interrelationships in-depth. This paper proposes SeeSign: a multimodal fusion framework based on statistical attention and contrastive attention for SLR. The designed two attention mechanisms are used to investigate intra-modal and inter-modal correlations of surface Electromyography (sEMG) and inertial measurement unit (IMU) signals, and fuse the two modalities. Statistical attention uses the Laplace operator and lower quantile to select and enhance active features within each modal feature clip. Contrastive attention calculates the information gain of active features in a couple of enhanced feature clips located at the same position in two modalities. The enhanced feature clips are then fused in their positions based on the gain. The fused multimodal features are fed into a Transformer-based network with connectionist temporal classification and cross-entropy losses for SLR. The experimental results show that SeeSign has accuracy of 93.17% for isolated words, and word error rates of 18.34% and 22.08% on one-handed and two-handed sign language datasets, respectively. Moreover, it outperforms state-of-the-art methods in terms of accuracy and robustness.