This paper addresses the problem of automatic beamforming for blind extraction of speech in a music environment, using multiple microphones. A new criterion is proposed based on the variance of the spectral flux (VSF), which is shown to be a compound measure of the Kurtosis and across-time correlation for the time-frequency domain signals. Spectral flux (SF) had been adopted as a feature that distinguishes speech from other acoustic noises and the VSF of speech tends to be larger than that of other acoustic sounds. Henceforth, maximization of VSF can be employed as one potential criterion to identify the speech direction-of-arrival (DOA), in order to extract speech from the noisy observations. We construct a VSF-inspired cost function and develop a complex-value fixed-point algorithm for the optimization. Then, the stability of the proposed algorithm is analyzed based on the second-order Taylor series expansion. Rather than the DOA identification ambiguity caused by subspace decomposition-based methods or maximization of non-Gaussianity-based approaches, both real and simulated evaluations indicate that the VSF-inspired criterion can effectively extract speech from a music diffuse noise field or a musical interference noise field. A key feature of the proposed approach is that it can operate blindly, i.e., it does not require a priori knowledge about the array geometry, the noise covariance matrix, or the geometrical knowledge of the location of desired speech. Therefore, this study offers a potential perspective for blindly extracting speech from a music environment.