A robust and speaker‐independent algorithm for the automatic segmentation of speech has been designed. It aligns a phonetic transcription with a phoneme nucleus detector, which is based on the temporal decomposition paradigm (TD) [B. Atal, IEEE Trans. Acoust. Speech Signal Process. ASSP‐26, 81–84 (1983); Bailly, Marteau, and Abry, Proc. Int. Conf. ASSP, Glasgow, Scotland, 508–511 (1989)]: The phonetic string is seen as overlapping emergence functions (EFs) whose maxima arise for the phoneme nucleus. The segmenter minimizes the reconstruction error (least‐squares error) between the time‐frequency representation of the speech signal and the above model. The automatic segmentation and alignment algorithm performs the task in three steps: (a) predetection of phonemes nuclei centers, (b) time alignment of the corresponding phonetic transcription, and (c) adjustment of these output nuclei centers and phoneme boundaries detection. The first step, which is inspired by Van Hemert's work [Van Hemert, Philips Tech. Rev. 43(9), 233–242 (1987)], uses an adaptive detection window to produce phonemes nuclei centers candidates. The second step uses the dynamic time warping (DTW) procedure to align these candidates with the known corresponding phonetic transcription. This DTW is guided by anchor points: a crude local probability function takes account of energy and zero‐crossings distributions for each phoneme. A new temporal decomposition technique gives an analytical solution with fixed number of targets and no compacity constraints. The TD errors between three consecutive candidates are used to calculate transient costs, thus enabling insertion and omission of nuclei centers. The third step adjusts the nuclei centers on the center of gravity of each of the corresponding EFs. It also produces a phoneme boundary segmentation considered by the time of equal adjacent EFs. This algorithm has been trained using 200 sentences pronounced by one speaker and tested using 50 sentences pronounced by seven speakers. On the test corpus, 86% of the phonemes nuclei centers candidates fall alone into one manual segment. In addition, 94% of the final nuclei centers match the manual segmentation.