It remains challenging for robots to accurately perform sound source localization and speech recognition in a real environment with reverberation, noise, and the voices of multiple speakers. Accordingly, we propose “U-TasNet-Beam,” a speech extraction method for extracting only the target speaker’s voice from all ambient sounds in a real environment. U-TasNet-Beam is a neural beamformer comprising three elements: a neural network for removing reverberation and noise, a second neural network for separating the voices of multiple speakers, and a minimum variance distortionless response (MVDR) beamformer. Experiments with simulated data and recorded data show that the proposed U-TasNet-Beam can improve the accuracy of sound source localization and speech recognition in robots compared to the conventional methods in a noisy, reverberant, and multi-speaker environment. In addition, we propose the spatial correlation matrix loss (SCM loss) as a loss function for the neural network learning the spatial information of the sound. By using the SCM loss, we can improve the speech extraction performance of the neural beamformer.