30. Pending modality shifts in AI


The energy around AI is palpable right now in May 2023. Across Twitter, HackerNews, and even publications like the New York Times, many are sharing exciting visions for the future. It’s a thrilling time to work in technology.

Yet, there are a few product and design-level modality shifts that could unlock a lot of value. Here are three.

From order-taker to context-gatherer: Models today are like overeager interns, pouncing on a user’s first utterance without asking follow-up questions. The result is overshooting some asks, missing the point, or in the case of autonomous agents, getting bogged down in ineffectual loops that drift towards nonsense.

This is only a problem in “unbounded” use cases, where an AI-user interface lacks structure or input validation, like ChatGPT or Poe. [1] Better prompting can help clarify asks and requirements, but this isn’t something I see happen in products often

From active to ambient input: Right now most AI-first products are very user-driven. Notion AI requires a user highlight text, summon a tooltip, and issue a command to rephrase the prose. ChatGPT requires a user navigate to chat.openai.com, type a prompt, and hit send. Spotify’s AI DJ requires a user navigate to the appropriate menu and commence a session. I’d call each of these “active inputs” to use AI.

There is much to explore on more passive methods for using AI. GitHub Copilot a great example of what I might call ambient input: an AI product that lives within an existing workflow and does not distract from that workflow. Microsoft recently announced Windows Copilot which brings a chat interface to the Windows OS. This still feels like an active input, but the path towards an always-on background agent is clear. In many ways, this system-level visibility and permission set feels like the ultimate goal for a powerful on-machine AI.

From interpreting requests to inferring intent: New technical capabilities drive new user experiences. As we interface with more powerful software with a deeper set of capabilities, so too do we need a way to explore and steer that set of capabilities to do precisely what we want. The first explorations into AI-driven software have been dominated by the “chat” paradigm, which have enabled users to more easily dialogue and iterate with an extremely general-purpose software. [2]

Interfaces will need to evolve to capture and simplify this ongoing iteration. Sam Schillace, a leader in Microsoft’s engineering division, explores this point in a great blog post [3]. I’ll summarize with a snippet:

I believe that most of what we will wind up doing eventually will be “talking” to the computer - where talking right now is mostly defined as “text chat” but will shortly be multimodal: images, gestures, voice, etc. As those interactions get more complex, it will be harder and harder to build “rigid” interfaces like we do now. Why build a bunch of static buttons that reflect and underlying, rigid schema somewhere, when you can let the user tell you what you want, find the result in a vector database, and iterate together?

Intent and Iteration will be a foundational metaphor for user experience in the next wave of software. We got click-and-drag windows interfaces when the tech was advanced enough to give us high resolution screens and fast processors that could handle the real-time interaction. We now have new capabilities that let us handle interaction with intent and meaning in real-time. Now it’s time to build the experiences that are native to those capabilities.

AI should get to a place where a user does not require explicit and concrete requests, like it might today, but rather understands personas and goals enough to infer the user's objectives.

[1] Logan recently touched on this on Twitter: https://twitter.com/OfficialLoganK/status/1655937516795723777

[2] https://www.willsab.com/blog/20-on-prompting.html

[3] https://sundaylettersfromsam.substack.com/p/intent-and-iteration-are-the-new