Thinking for Voice: Intentional Intents

In the voice design world, we often try to represent inputs from a user in terms of an Intent. Literally, we are trying to map from what the user says to what they intend.

Intents are a collection of actions or phrases that will all mean the same thing to the voice agent. This means that we can assign phrases such as "What is the weather?" and "How's the weather?" to the same Intent, and the system will respond the same way to both. Intents have some flexibility to them to accept parameters, so "What is the weather today?" and "What will the weather be tomorrow?" can use the same Intent, but have a different Parameter representing the day we're asking about.

In this example, there may be other Parameters for location as well, or a user may not specify the date or location, and our system should use defaults. In all these cases, however, we would be using the same Intent.

As we design our conversations, we are typically on the lookout for phrases that we may express different ways, but that should get the same result (or equivalent variants of the same result). These would be ideal candidates to boil into one Intent. But this could be deceptive.

Identifying Intents

A couple of weeks ago, I attended a presentation by Rebecca Evanhoe that included a discussion about problems we might run into properly determining which phrases are truly equivalent.

To continue our example above, would we actually expect the same result if someone asked "What is the weather?" vs "How's the weather?" Perhaps, but perhaps not. Our system might reply with something like "Looks like it will rain today" and the forecasted high temperature. If there is other important information, such as a weather alert currently active, it might mention this as well. But what other things might people ask if they were asking about the weather? Perhaps questions like

  • Will it rain today?
  • How much snow will we get?
  • What time will the rain start?
  • Should I bring in my plants tonight?
  • When will the tornado watch end?

Initially, we might think these are all "equivalent" questions - most want to know about the current weather. Yet some are more specific than others, and answering the general question will feel wrong. 

Rebecca brilliantly argues that this difference is the gap between what it feels like talking to a human and talking to a robot, and that failure to take this into account is what causes our systems to not feel either intelligent or conversational.

Humans can do this instinctively. Can we use something like machine learning to help our systems do the same? Perhaps. But they still require a huge amount of training to get the linguistics patterns correct, and once we're collecting those linguistics patterns, we have an easier solution. Sorta.

We can just build multiple Intents for phrases that are truly different, or that deserve different responses. With a good Natural Language Processing (NLP) system, we can even see what other phrases people use over time that we don't perfectly match, and attempt to either match them to existing Intents, or create new Intents for them.

This seemed like a lot of work to me, so during the Q&A, I asked Rebecca what I honestly thought wasn't a leading question: How can we make this easier?

What they say vs What you do

I was hoping for a silver bullet as an answer, but there isn't one. Her suggestion was to identify the most common phrases and determine the best and most useful responses for them. Where reasonable, combine the phrases into a single Intent, pick the most important others for their own Intent, and be prepared with a generic answer if necessary for any others.

This, of course, made sense in our design.

But as a developer, it was still frustrating. It meant writing a bunch of different Intent Handlers that all did similar, but not quite the same, things in order to produce responses that were similar, but not quite the same. Was there something that could be done to make this easier?

Thinking about it, I realized that I was ignoring my own advice. Intents represent what the user says, but doesn't dictate how we handle what they've said. Intents might dictate the form the response might take, but the underlying data that goes into that response is largely the same thing.

In our weather example, all of the questions ask us to get the weather for some default location for today, and then extract part of that information for the response. Not all the responses may make sense at a given moment - if we're asked about the time the rain will start, there may be no rain in the forecast, and we need to reply appropriately.

We can think of this as a "meta-Intent", or an overarching topic that we're being asked about. Dialogflow has this sort of concept built in using a field called "Action" (which unfortunately conflicts with the Assistant concept of Action), but as developers, we could group them together in our code if we wished.

Under this scheme, we can break it down this way:
  • An Intent still represents what the user says or does, making it as fine grained as we feel necessary, and possibly including parameters.
  • One or more Intents will map to a single Topic or Action, which indicates what processing we will do with this information. We'll get some data from this processing.
  • Based on both the output of the processing and the Intent that started this, we'll formulate a Response back to the user.
By splitting it into these three parts, we can more easily handle lots of different things the user might say, and tailor the response directly to how they're saying it and what they expect.

Ok, but how?

Once I realized we could break it up this way, I realized that the multivocal library already supported this type of behavior. With multivocal, the different parts are used this way:
  • Using Dialogflow, we identify the phrases that go into an Intent. We also define the Action name that will be reported for this Intent. While each Intent name must be unique, we can give the same Action name to multiple Intents.
  • In our fulfillment webhook, we can set an Action Handler that executes a function for every request that comes in for the same Action name. 
    • This loads the data that is relevant to the Action based on the Intent's parameters or other state that we might have. 
    • If there is a problem, we may set an Outent.
  • Finally, multivocal picks a set of possible Responses based on the Outent, if that is set, or the Intent name if not. 
    • Based on other conditions we may have attached to some responses, we may further filter out which Responses are valid based on the data.
    • Each response is actually a template with markers indicating where the data we've loaded fit into the phrase being said back to the user.

Given our example above, we might build a bunch of Intents in Dialogflow and set them to the same Action name. Without showing all of them, the relevant parts of a few might look something like this:
These examples are missing many things we would see in fully fleshed out Intents, such as sample phrases that include the day we want them for or a location. It could even be argued that we would want separate Intents for rain vs snow, or that these should be parameters. However, these will do for now to illustrate how to handle them.

Each of these would require different responses, or different possible responses. Our configuration for those responses might look something like this:

  "Local": {

    "en-US": {
      "Response": {

        "Intent.weather.basic": [
          "It is currently {{weather.condition}} and {{weather.temp.f}} degrees.",
          "Currently it is {{weather.temp.f}} outside and {{weather.condition}}.",
          "Looks like it is {{weather.condition}}, with the thermometer saying {{weather.temp.f}}."

        "Intent.weather.precip": [
            "Base": {"Set": true},
            "Criteria": "{{truthy weather.forecast.precip}}"
          "Yup, {{weather.forecast.precip}} is in the forecast.",
          "Looks like we can expect {{weather.forecast.precip}}.",

            "Base": {"Set": true},
            "Criteria": "{{falsey weather.forecast.precip}}"
          "Nope, looks {{weather.condition}} at the moment.",
          "Probably not, should be {{weather.forecast.condition}}."

        "Intent.weather.precip.quant": [
            "Criteria": "{{truthy weather.forecast.precip}}",
            "Template": {
              "Text": "Expect about {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."
            "Criteria": "{{and (eq weather.forecast.precip 'snow') (gt weather.forecast.quant.in 12)}}",
            "Template": {
              "Text": "Be careful! Could be over {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."

            "Criteria": "{{falsey weather.forecast.precip}}",
            "Template": {
              "Text": "We're not expecting any precipitation right now." 

The definition of responses for the "weather.basic" Intent is the most straightforward. We provide three possible responses, all of which may be used. In each, we use a weather object which we'll discuss momentarily, but which has fields that store the condition ("clear", "partly cloudy", "raining") and the temperature that we include in the result.

The other two Intents are a little more complicated, but similar to each other. For the "weather.precip" Intent, we list a number of possible responses again, but this time we will set criteria for some of those responses by setting "base" values, which will be applied to all later responses (until changed). So we will have two responses whose criteria are if the forecasted precipitation field in the weather object is set to something such as "rain" or "snow", indicating that precipitation is in the forecast, and two whose criteria are the opposite. Those responses use either the forecasted precipitation type or, in one case, the current conditions (since theres nothing in the forecast).

The "weather.precip.quant" Intent is similar, but instead of setting base criteria, we include the criteria directly in the response itself. So when there is precipitation expected, we have an additional possible response if there are more than 12 inches of snow expected.

But how is the weather object itself populated? Through an Action handler function we write and then register when we setup our webhook

function handleWeather( env ){
  const location = env.location;  // Assume we build this somewhere
  const date     = env.date;      // Assume we build the date too
  return getWeather( location, date )
    .then( data => {
      const weather = {
        // Build the weather object from the data we get back
      env.weather = weather;
      return Promise.resolve( env );

Multivocal.addActionHandler( "weather", handleWeather );

I've omitted lots here, since I don't want to go into details about calling an API, but I'm going to assume we have a few things:
  • We get the location and date somehow. This could be by the user saying so, us asking for the location, assuming defaults, or something else.
  • With this location and date, we can call getWeather(), which uses an API of some sort to get the data. Since this is probably asynchronous, we'll return a Promise that resolves to the data with all the weather information.
  • From this data, we have everything we need to build our weather object that we use with the responses.


Separating our business logic from our responses, and having our responses be more closely tied to the Intents that the user triggers, seem like a good solution to the problem of having to make many Intents that vary only slightly from each other.

Different platforms may handle this differently. It is worth noting that Samsung's Bixby platform comes with tools that work similarly to what multivocal offers, while Amazon's ASK would require you to build something that bundles the Intents into Topics yourself. But conceptually, the idea is sound and useful for no matter what platforms you design and develop against.

My thanks to Rebecca Evanhoe for her presentation that sent me down this thought path.

No comments:

Post a Comment