Deep neural networks (DNNs) are easily biased towards the training set, which causes substantial performance degradation for out-of-distribution data. Many methods are studied to generalize under various distribution shifts in the literature of domain generalization (DG). To facilitate practical DG research, we construct a large-scale non-independent and identically distributed Chinese characters dataset called PaHCC (Printed and Handwritten Chinese Characters) for a real application scenario (generalization from Printed Fonts to Handwritten Characters, PF2HC) of DG methods. We evaluate eighteen DG methods on the proposed PaHCC dataset and demonstrate that the performance of the current algorithms on this dataset remains inadequate. To improve the performance, we propose a radical-based multi-label learning method by integrating structure learning into statistical methods. Moreover, in the dynamic evaluation settings, we discover additional properties of DG methods and demonstrate that many algorithms suffer from unstable performances. We advocate that researchers in the DG community pay attention not only to accuracy under the fixed leave-one-domain-out protocol but also to algorithmic stability across variable training domains in future studies. Our dataset, method, and evaluations bring valuable insights to the DG community and could promote the development of realistic and stable algorithms.