This study examines participants’ vocal accommodation toward text-to-speech (TTS) voices produced by three devices, varying in the extent to which they embody a human form. Thirty eight speakers shadowed words produced by a male and female TTS voice presented across three physical forms: an Amazon Echo smart speaker (least human-like), Nao robot (slightly more human-like), and a Furhat robot (more human-like). Ninety-six independent raters completed a separate AXB perceptual similarity assessment, which provides a holistic evaluation of accommodation. Results show convergence to the voices across all physical forms; convergence is even stronger toward the female TTS voice when presented with the Echo smart speaker form in the female TTS voice, consistent with participants' higher rated likability and lower creepiness of the Echo. We interpret our findings through the lens of communication accommodation theory (CAT), providing support for accounts of speech communication and human–computer interaction frameworks.