Assistants are all the rage right now. Everyone seems to be working on one. People imagine them becoming the new interface to computers — why bother with apps and searching the web when you can just ask your assistant to do it for you?
Yet, a major challenge stands between the dream assistant and the current reality. It’s called the multi-agent problem, and most companies are reluctant to talk about it. The solution will ultimately determine how much of an impact assistants will have.
How assistants work
Assistants have a dirty little secret: They don’t actually understand you. At least, not in the way you might think they do.
In order to make them easier to deploy and reuse, early developers of assistants designed them to be not just one large program but many small ones — each one specialized in completing a particular task, such as booking an appointment, calling a car, setting an alarm, and so on. They called the mini-programs agents. The assistant itself didn’t have to know anything specific about these tasks, it is merely interpreting the words of users and picking the best agent for the job.
Virtually every modern assistant on the market today has copied this approach. Some assistants, like Amazon’s Alexa and Microsoft’s Cortana, even allow outside developers to add new agents (Amazon calls them skills, Microsoft calls them bots.)
There are some serious drawbacks to this approach.
Since all of the expert, task-oriented knowledge about tasks is trapped inside of the agents, the assistant itself is left with virtually no understanding of the meaning behind your words. Instead, it merely looks for patterns — keywords and phrases — and guesses. If it doesn’t recognize some words, it ignores them, hoping that maybe the parts it couldn’t understand weren’t that important.
This is why assistants are remarkably easy to trip up; they don’t actually understand the meaning of your words. You may have noticed that they’ll occasionally do radically different things when you alter your commands ever so slightly.
It’s really difficult for today’s assistants to handle tasks that require activating more than one agent. For example, if you ask your assistant to help you find a good place for brunch and ask it to call you a car in the same sentence, it’s not clear which agent is best suited to handle the job — the restaurant-finder agent or the car-calling agent.
Things get harder still when a company opens its assistant to outside developers. Now the assistant has to distinguish between dozens of agents, each claiming to be the best at handling a particular task. If an assistant has Yelp, Foursquare, TripAdvisor, and Google Places agents for example, how does it determine which one should help you find a place for that special date??
This is the crux of the multi-agent problem: how does an assistant, with limited knowledge of the world, a limited set of isolated agents, many of which might claim to do the same thing – choose which one to activate for every command in a way that will make users happy?
Working toward the solution
Early assistants, like Siri and the first versions of Alexa, worked around this problem by carefully curating the agents, keywords, and phrases they understood. Like a magician carefully arranges a trick to make you think you saw something you didn’t, thoughtful designers created the illusion that these assistants were capable of a lot more than they really were.
As people are expecting more and more from their assistants, there is pressure to open these assistants up to outside developers. This makes the multi-agent problem unavoidable.
Alexa and Cortana both solve the issue in part by forcing the user to decide which agent to use. (“Alexa, ask Dominos to send me a pizza.”) Apple is taking a typically conservative and measured approach, allowing a limited set of agents. They’re mostly handling reservation bookings and car hailing.
Most developers in the space are hoping that more sophisticated natural language processing or machine learning will bring an answer. Microsoft, Google, Apple, and Viv are all making major investments in these areas. Still others are trying to go further by giving the assistant more knowledge about the world. Ozlo, my own assistant, looks directly at the data inside of agents to try to improve its understanding.
It’s not clear what will work, or which solution will ultimately prove the winner. We can look back to the early days of web search for parallels, though.
Early search engines took a similar approach to today’s assistants. Rather than peering directly into each web page, they relied on web page descriptions provided by the authors – so-called metadata. For example, if you were building a website about dogs, you might put keywords in your metadata like “dog,” “canine,” and “pets.” Search engines would show results solely based on the words present in the metadata.
It didn’t take long for a multitude of sites to claim “best in show” for all categories of information. As the web grew, less scrupulous website authors even filled their metadata with keywords that had nothing to do with their page, just to draw more traffic.
Eventually, Google solved this problem by taking the additional step of actually reading the contents of the web pages themselves, sometimes ignoring the metadata altogether. Only then did web search begin to approach the universal quality that we’ve grown to expect today.
While there are parallels, solving the multi-agent problem is not a simple replay of web search. The user requirements, technology, and even the data involved are radically different. But it does seem likely that until assistants begin to understand the tasks they are offering to users, it will be hard for them to meet the high hopes we all have for this category.
Only time will tell. This category is still very young and is still evolving. But look carefully for the companies solving this problem head-on; they are the ones likely to dominate the next wave of intelligent assistants.