The ability to accurately encode the temporal information of sensory events and hence to make prompt action is fundamental to humans’ prompt behavioral decision-making. Here we examined the ability of ensemble coding (averaging multiple inter-intervals in a sound sequence) and subsequent immediate reproduction of target duration at half, equal, or double that of the perceived mean interval in a sensorimotor loop. With magnetoencephalography (MEG), we found that the contingent magnetic variation (CMV) in the central scalp varied as a function of the averaging tasks, with a faster rate for buildup amplitudes and shorter peak latencies in the “half” condition as compared to the “double” condition. ERD (event-related desynchronization) -to-ERS (event-related synchronization) latency was shorter in the ”half” condition. A robust beta band (15–23 Hz) power suppression and recovery between the final tone and the action of key pressing was found for time reproduction. The beta modulation depth (i.e., the ERD-to-ERS power difference) was larger in motor areas than in primary auditory areas. Moreover, results of phase slope index (PSI) indicated that beta oscillations in the left supplementary motor area (SMA) led those in the right superior temporal gyrus (STG), showing SMA to STG directionality for the processing of sequential (temporal) auditory interval information. Our findings provide the first evidence to show that CMV and beta oscillations predict the coupling between perception and action in time averaging.