ABSTRACT Background While many measures exist for assessing discourse in aphasia, manual transcription, editing, and scoring are prohibitively labor intensive, a major obstacle to their widespread use by clinicians (Bryant et al. 2017; Cruice et al. 2020). Many tools also lack rigorous psychometric evidence of reliability and validity (Azios et al. 2022; Carragher et al. 2023). Establishing test reliability is the first step in our long-term goal of automating the Brief Assessment of Transactional Success in aphasia (BATS; Kurland et al. 2021) and making it accessible to clinicians and clinical researchers. Aims We evaluated multiple aspects of test reliability of the BATS by examining correlations between human/machine and human/human interrater edited transcripts, raw vs. edited transcripts, interrater scoring of main concepts, and test-retest performance. We hypothesized that automated methods of transcription and discourse analysis would demonstrate sufficient reliability to move forward with test development. Methods & Procedures We examined 576 story retelling narratives from a sample of 24 persons with aphasia and familiar and unfamiliar conversation partners (CP). Participants with aphasia (PWA) retold stories immediately after watching/listening to short video/audio clips. CP retold stories after six-minute topic-constrained conversations with a PWA in which the dyad co-constructed the stories. We utilized two macrostructural measures to analyze the automated speech-to-text transcripts of story retells: 1) a modified version of a semi-automated tool for measuring main concepts (mainConcept: Cavanaugh et al. 2021); and 2) an automated natural language processing “pipeline” to assess topic similarity. Outcomes & Results Correlations between raw and edited scores were excellent, interrater reliability on transcripts and main concept scoring were acceptable. Test-retest on repeated stimuli was acceptable. This was especially true of aphasic story retellings where there were actual within subject repeated stimuli. Conclusions Results suggest that automated speech-to-text was generally sufficient in most cases to avoid the time-consuming, labor intensive step of transcribing and editing discourse. Overall, our study results suggest that natural language processing automated methods such as text vectorization and cosine similarity are a fast, efficient way to obtain a measure of topic similarity between two discourse samples. Although test-retest reliability for the semi-automated mainConcept method was generally higher than for automated methods of measuring topic similarity, we found no evidence of a difference between machine automated and human-reliant scoring.