Federated learning (FL) has shown its great potential for achieving distributed intelligence in privacy-sensitive IoT. However, popular FL approaches such as FedAvg and its variants share model parameters among clients during the training process and thus cause significant communication overhead in IoT. Moreover, non-independent and identically distributed (non-IID) data across learning devices severely affect the convergence and speed of FL. To address these challenges, we propose a communication-efficient Federated Learning framework based on Two-step Knowledge Distillation, Fed2KD, which boosts the classification accuracy through privacy-preserving data generation while improving communication efficiency through a new knowledge distillation scheme empowered by an attention mechanism and metric learning. The generalization ability of Fed2KD is analyzed from the view of domain adaption. Extensive simulation experiments are conducted on Fashion-MNIST, CIFAR-10, and ImageNet datasets with various non-IID data distributions. The performance results show that Fed2KD can reduce the communication overhead and improve classification accuracy compared to FedAvg and its latest variants.