The circuitry and pathways in the brains of humans and other species have long inspired researchers and system designers to develop accurate and efficient systems capable of solving real-world problems and responding in real-time. We propose the Syllable-Specific Temporal Encoding (SSTE) to learn vocal sequences in a reservoir of Izhikevich neurons, by forming associations between exclusive input activities and their corresponding syllables in the sequence. Our model converts the audio signals to cochleograms using the CAR-FAC model to simulate a brain-like auditory learning and memorization process. The reservoir is trained using a hardware-friendly approach to FORCE learning. Reservoir computing could yield associative memory dynamics with far less computational complexity compared to RNNs. The SSTE-based learning enables competent accuracy and stable recall of spatiotemporal sequences with fewer reservoir inputs compared with existing encodings in the literature for similar purpose, offering resource savings. The encoding points to syllable onsets and allows recalling from a desired point in the sequence, making it particularly suitable for recalling subsets of long vocal sequences. The SSTE demonstrates the capability of learning new signals without forgetting previously memorized sequences and displays robustness against occasional noise, a characteristic of real-world scenarios. The components of this model are configured to improve resource consumption and computational intensity, addressing some of the cost-efficiency issues that might arise in future implementations aiming for compactness and real-time, low-power operation. Overall, this model proposes a brain-inspired pattern generation network for vocal sequences that can be extended with other bio-inspired computations to explore their potentials for brain-like auditory perception. Future designs could inspire from this model to implement embedded devices that learn vocal sequences and recall them as needed in real-time. Such systems could acquire language and speech, operate as artificial assistants, and transcribe text to speech, in the presence of natural noise and corruption on audio data.