In computer-assisted pronunciation training (CAPT), the scarcity of large-scale non-native corpora and human expert annotations are two fundamental challenges to non-native acoustic modeling. Most existing approaches of acoustic modeling in CAPT are based on non-native corpora while there are so many living languages in the world. It is impractical to collect and annotate every non-native speech corpus considering different language pairs. In this work, we address non-native acoustic modeling (both on phonetic and articulatory level) based on transfer learning. In order to effectively train acoustic models of non-native speech without using such data, we propose to exploit two large native speech corpora of learner's native language (L1) and target language (L2) to model cross-lingual phenomena. This kind of transfer learning can provide a better feature representation of non-native speech. Experimental evaluations are carried out for Japanese speakers learning English. We first demonstrate the proposed acoustic-phone model achieves a lower word error rate in non-native speech recognition. It also improves the pronunciation error detection based on goodness of pronunciation (GOP) score. For diagnosis of pronunciation errors, the proposed acoustic-articulatory modeling method is effective for providing detailed feedback at the articulation level.