Abstract

Transformer can effectively model long rang dependency, but suffer from uncapable to extract local feature patterns. While CNNs exploit local features effectively. In this paper, we seek to combine convolution and Transformers improves over using them individually, and propose improved features using convolution-augmented transformers for keyword spotting. The convolution-augmented transformers are constructed with a ResNet front-end and a convolution-augmented transformers back-end in series. Using this improved feature for keyword spotting task. The results show that the improved features using convolution- augmented transformers can yield at least 3% improvement compared with other features.

Full Text
Published version (Free)

Talk to us

Join us for a 30 min session where you can share your feedback and ask us any queries you have

Schedule a call