Speech segregation from a monaural recording is a primary task of auditory scene analysis, and has proven to be very challenging. We present a multistage model for the task. The model starts with simulated auditory periphery. A subsequent stage computes midlevel auditory representations, including correlograms and cross-channel correlations. The core of the system performs segmentation and grouping in a two-dimensional time-frequency representation that encodes proximity in frequency and time, periodicity, and amplitude modulation (AM). Motivated by psychoacoustic observations, our system employs different mechanisms for handling resolved and unresolved harmonics. For resolved harmonics, our system generates segments—basic components of an auditory scene—based on temporal continuity and cross-channel correlation, and groups them according to periodicity. For unresolved harmonics, the system generates segments based on AM in addition to temporal continuity and groups them according to AM repetition rates. We derive AM repetition rates using sinusoidal modeling and gradient descent. Underlying the segregation process is a pitch contour that is first estimated from speech segregated according to global pitch and then adjusted according to psychoacoustic constraints. The model has been systematically evaluated, and it yields substantially better performance than previous systems.