• In Transit
  • Posts
  • The Potential of Voice Interfaces in AI

The Potential of Voice Interfaces in AI

Voice is the interface unlock for a billion people using AI, and we’ll all feel a little like Tony Stark in Iron Man.

Hey! Hope you’re doing great. I’m still keeping the weekly cadence for the newsletter for the fifth week straight.Thanks for being a subscriber, and a warm welcome to new ones. PS! If you want to help me grow this newsletter, and you have friends that would enjoy the content, there’s now a way for everyone to win. Hit the refer button below, and you might unlock some kick-ass rewards like selecting the topic for one edition of In Transit.

 

Key takeaways:

  1. Voice is key for AI mass adoption.

  2. ChatGPT's voice update is game-changing.

  3. Voice reduces AI interaction friction.

  4. Overlaying voice on existing interfaces is next.

  5. Jarvis-like voice assistants are becoming real.

Have you watched Iron Man? (1, 2, or 3 - or maybe all of them)? If you have, then you’ll remember Jarvis. If not, I’ll set the scene:

Throughout the movie(s), Tony Stark, the protagonist, is regularly seen talking with his AI companion Jarvis, solving problems, discussing, and joking.

Jarvis is all-knowing, has access to all the data, and is super-smart (duh), and he (it?) sounds more or less like a human.

It’s a cool spiel, a little sci-fiFuturistic.

And now, that future seems to have arrived. And last week, I got to feel a little like Tony Stark.

I’ll explain in a second, but the big epiphany I had was this:

Voice is the critical interface for mass-adoption of generative AI tools. 

Hello, My Name is GPT

The ChatGPT mobile app recently got a significant update: listening, speaking, and seeing. These features are currently in beta and thus are not accessible to everyone yet.

I got access to it last week (yay).

The voice part works like this:

  1. Select the voice you want for it

  2. Activate voice mode

  3. Start talking

  4. ChatGPT talks back

  5. Repeat

  6. A written transcript of the conversation is recorded

My first try was a rudimentary conversation about no big things.

Later that day, I worked on a spreadsheet on some growth models for DX. I got stuck, unlocked my phone, and put ChatGPT into voice mode. I quickly described what I was working on and where I was stuck.

ChatGPT started proposing solutions and things worth considering.

I continued working on my spreadsheet. Phone still unlocked, ChatGPT still there. Me conversing with an AI, reasoning and discussing, all while building out the models in my spreadsheets.

Before I knew it, I had spent 20 minutes conversing with an AI while working.

The experience left me with two things.

  1. Feeling a bit like Tony Stark: 

  2. Conviction that voice interfaces are a key unlock for mass adoption of generative AI

Why Voice is Important

Voice will be vital because it enables a few things for AI tools. First, it removes barriers. There’s much less friction when you can speak to the computer, and it replies and performs actions, as opposed to the staggered interaction of writing on a keyboard.

Second, it humanizes the AI in an exciting way that’s hard to explain until you’ve tried and experienced getting lost in conversation with a computer. 

Shuffling through five different voice profiles to find the one you’re the most comfortable with adds to this.

Third, it’s giving us an early taste of what will surely be the next significant computing modality shift. We went from stationary computing to on-the-go computing. Ambient, always on, computing is next.

The ChatGPT interface isn’t fully ambient because you have to unlock your phone, open the app, and enable voice mode. But once you’re there, it’s a nice preview of what’s to come.

What’s Next

The current version of the GPT voice interface is limiting. It doesn’t have context, doesn’t see your screen, etc.

It works great to have it make up bedtime stories on the spot.

But using it as a writing assistant is cumbersome because it can’t see what you’re writing at the same time as you’re talking to it.

Imagine, instead, in my earlier example with the spreadsheets that ChatGPT could see my screen and understand what I was working on. In that context, I could work on the numbers alongside the AI. It could make edits in the same sheet as we’re discussing it.

This duality, voice as an interface overlayed on top of an existing interface, is one of the things we’ll see next. And soon. The Microsoft Copilot is well-positioned for this.

Beyond that, it’s interesting to imagine what voice-only interfaces of this quality (both voice quality and intelligence) will enable.

Soon enough, we’ll all be able to have our own Jarvis.

Thanks for reading another edition of In Transit. 

Please subscribe if you want to catch future posts early. 

You can follow me on LinkedIn for more frequent, bite-sized content. 

And if you have any feedback or an idea you want to discuss, you can reach me at m[at]in-transit.xyz.