This paper presents a speech-based system for autism severity estimation combined with automatic speaker diarization. Speaker diarization was performed by two different methods. The first used acoustic features, which included Mel-Frequency Cepstral Coefficients (MFCC) and pitch, and the second used x-vectors - embeddings extracted from Deep Neural Networks (DNN). The speaker diarization was trained using a Fully Connected Deep Neural Network (FCDNN) in both methods. We then trained a Convolutional Neural Network (CNN) to estimate the severity of autism based on 48 acoustic and prosodic features of speech. One hundred thirty-two young children were recorded in the Autism Diagnostic Observation Schedule (ADOS) examination room, using a distant microphone. Between the two diarization methods, the MFCC and Pitch achieved a better Diarization Error Rate (DER) of 26.91%. Using this diarization method, the severity estimation system achieved a correlation of 0.606 (Pearson) between the predicted and the actual autism severity scores (i.e., ADOS scores). Clinical Relevance- The presented system identifies children's speech segments and estimates their autism severity sc30:310ore.