Scene parsing becomes a key step to develop a visual autonomous driver. Real-world images are too expansive to annotate at scale, while few-shot cross-domain scene parsing (CSP) approaches only require a few labeled target images to train a model with source virtual data, thus, attracting more attention in the community. However, since the target training images are too few to support the cross-domain measures in statistics, it is inappropriate of resembling the spirit of conventional domain adaptation. In this paper, we reconsider this imbalance transfer learning demand as a covariate balancing issue regularly found in Rubin causal framework. We first consider the domain adaptation in pixels in the view of the average treatment effect (ATE), in which data are categorized into a treatment group or a control group in terms of the domain identity taken as the treatment. In this manner, the pair of domains could be perfectly aligned if the ATE converges to zero. It motivates Counterfactual Balance Feature Alignment (CBFA) to mitigate the cross-domain imbalance in the categories. CBFA revises existing adversarial adaptation techniques by modeling the propensity score for all pixels in their contexts, for the sake of predicting which groups they belong to. The propensity score for a pixel refers to its output of the domain discriminator and can be applied to balancing the adversarial adaptation objective. We evaluate our method on two suites of virtual-to-real scene parsing setups. Our method has obtained the new state of the art across 1-5 shot scenarios (in particular, 1-shot 56.79 in SYNTHIA-to-CITYSCAPES and 51.56 in GTA5-to-CITYSCAPES), demonstrating our motivation of building the connection between ATE and domain gap.