This paper aims to solve the problems of difficult convergence of audio model training, large data demand, and large dimensionality of storage space for audio-generated feature vectors. To this end, this paper proposes the use of quaternion Gabor filtering to suppress the background information of the spectrogram and reduce the interference of the data for the case of insufficient data alignment between audio data and image data after shifting the domain. In addition, different scales of window lengths and frame shifts are used to capture the connections between different vocal objects. To address the problem that the generated feature vectors are large dimensional, we use a deep hash module to map high-dimensional features to low-dimensional features and use a probability function to make the learned samples more consistent with the overall distribution. In the experimental evaluation, the proposed method was evaluated on the environmental sound classification dataset and the music genre classification dataset. The proposed method uses only a common backbone network and improves the accuracy of audio recognition.