28. Working with AI through speech


A few weeks ago I was curious to test what interacting with AI through speech might feel like. I’d been using ChatGPT, Claude, and other AI products for some time, but wondered if verbal dialogue might add anything to the interaction.

The tldr; is that there is something magical about using voice to command a machine and hear a human-like voice in response. However, the realities of working with a voice-first product today make the experience suboptimal for most use cases. To use a film reference, and having recently rewatched Her, I can’t say that we are close to communicating with Samantha, but the path there seems clearer than I thought. [1]

Building an AI voice proof of concept

I kept this proof of concept very light and wrote everything in Python. I created a script called local_chat.py that, when run, initiates a dialogue with an assistant that speaks its responses. I called the assistant Graham. [2]

Graham works by:

  • Using the SpeechRecognition library locally to listen to the computer’s audio inputs.
  • Sending the resulting text to OpenAI’s gpt-3.5-turbo model and parsing the response.
  • Vocalizing the response with ElevenLabs’ text-to-speech API.
  • Each tool had alternatives, but I decided to optimize for response and speech latency. For that, SpeechRecognition won out against things like gTTS or Whisper. GPT-4 may have delivered better responses, but GPT-3.5’s much faster response time was a better trade-off. I used my Mac’s say command for vocalizing responses, but the computer-garbled effect made the experience far less magical. In this case I traded speed for experience with ElevenLabs’ (still quick) API. [3]

    I also explored what it might feel like to speak with this model over the phone. For this I wired the above script to a Twilio phone number. I’d call the number or initiate a call with the call.py script and interact with the AI as above.

    The user experience

    It feels special engaging in a speech conversation with an AI that responds in a lifelike voice. Yet, the experience today feels both like the future and the clunky present. A few things make this hard to work with right now.

    First and foremost is latency. While it’s very cool to speak to Graham, one quickly gets frustrated with the seconds that pass between request and response. It feels like an eternity, moreso than, say, waiting for a text response from GPT-4. Something about speech as an input method makes me impatient.

    Beyond indexing on the fastest technologies I mentioned above, I tried optimizing latency in the interaction. For instance, I found batching functions felt very slow, so I run speech-to-text in chunks as soon as my voice starts and run text-to-speech in chunks as GPT-3.5’s response comes in. This helps somewhat, but not entirely.

    Second, voice does not lend itself to information density. When using Graham, if I ask questions regarding knowledge, it reads a stream of dense paragraphs that are difficult to consume. [4] Asking, however, for help with booking a flight to New York triggered a more “transactional” dialogue that felt more natural over voice and phone [5].

    Related to density, dialogue intended to understand or explore a topic is particularly ill-suited to voice. With audio, information is transmitted once; with text, the information is recorded and visually “callable” seconds, minutes, or hours later. Dense information can be revisited endlessly.

    I mentioned I wired this up to Twilio to get a feel for interacting with AI over the phone. All the above considerations still play here, but I will say that a few additional edges take away from the magic even more. For instance, I found speech recognition fails much more frequently over phone, presumably due to lossiness over the phone channel. I also noticed additional latency.

    In sum, voice is a very exciting potential future path for human-AI interaction, but it does not feel like the technical stack is there yet to facilitate a meaningfully magic experience for users. I'm excited to see it develop.

    [1] I can’t say I love the movie as I found the plot and pacing quite slow. However, I do think that the movie does an extraordinary job of exploring how humans and superintelligent services might interact in the future. What makes this exploration even more deeply satisfying is that, yes, there is a lot of thought put into the AI technology, its design, and the human-AI interaction, but it also raises many interesting questions about how society, how relationships, and human thinking and emotion change as a result.

    [2] "Mr. Watson - come here - I want to see you."

    [3] Another voice solution I played around with is Bark, which uses a generative audio model to produce incredibly lifelike voices and speech patterns. However, because the Bark model actually generates sound files, this approach would have introduced significant latency to the proof-of-concept. As a proof point, using Colab’s GPUs took about 10 seconds to generate a short 25-word message with the Bark model.

    [4] Research from MIT suggests we process visual information faster than audio. This seems logical. The eye can consume a paragraph of text and is rate-limited mainly by the speed of light. The ear can hear a paragraph of text and is rate-limited by the transmitter’s speed of speech.

    [5] I didn’t give the model any action-taking capabilities so this was purely a test.