Thinking for Voice: Intentional Intents

In the voice design world, we often try to represent inputs from a user in terms of an Intent. Literally, we are trying to map from what the user says to what they intend.

Intents are a collection of actions or phrases that will all mean the same thing to the voice agent. This means that we can assign phrases such as "What is the weather?" and "How's the weather?" to the same Intent, and the system will respond the same way to both. Intents have some flexibility to them to accept parameters, so "What is the weather today?" and "What will the weather be tomorrow?" can use the same Intent, but have a different Parameter representing the day we're asking about.

In this example, there may be other Parameters for location as well, or a user may not specify the date or location, and our system should use defaults. In all these cases, however, we would be using the same Intent.

As we design our conversations, we are typically on the lookout for phrases that we may express different ways, but that should get the same result (or equivalent variants of the same result). These would be ideal candidates to boil into one Intent. But this could be deceptive.

Identifying Intents

A couple of weeks ago, I attended a presentation by Rebecca Evanhoe that included a discussion about problems we might run into properly determining which phrases are truly equivalent.

To continue our example above, would we actually expect the same result if someone asked "What is the weather?" vs "How's the weather?" Perhaps, but perhaps not. Our system might reply with something like "Looks like it will rain today" and the forecasted high temperature. If there is other important information, such as a weather alert currently active, it might mention this as well. But what other things might people ask if they were asking about the weather? Perhaps questions like

  • Will it rain today?
  • How much snow will we get?
  • What time will the rain start?
  • Should I bring in my plants tonight?
  • When will the tornado watch end?

Initially, we might think these are all "equivalent" questions - most want to know about the current weather. Yet some are more specific than others, and answering the general question will feel wrong. 

Rebecca brilliantly argues that this difference is the gap between what it feels like talking to a human and talking to a robot, and that failure to take this into account is what causes our systems to not feel either intelligent or conversational.

Humans can do this instinctively. Can we use something like machine learning to help our systems do the same? Perhaps. But they still require a huge amount of training to get the linguistics patterns correct, and once we're collecting those linguistics patterns, we have an easier solution. Sorta.

We can just build multiple Intents for phrases that are truly different, or that deserve different responses. With a good Natural Language Processing (NLP) system, we can even see what other phrases people use over time that we don't perfectly match, and attempt to either match them to existing Intents, or create new Intents for them.

This seemed like a lot of work to me, so during the Q&A, I asked Rebecca what I honestly thought wasn't a leading question: How can we make this easier?

What they say vs What you do

I was hoping for a silver bullet as an answer, but there isn't one. Her suggestion was to identify the most common phrases and determine the best and most useful responses for them. Where reasonable, combine the phrases into a single Intent, pick the most important others for their own Intent, and be prepared with a generic answer if necessary for any others.

This, of course, made sense in our design.

But as a developer, it was still frustrating. It meant writing a bunch of different Intent Handlers that all did similar, but not quite the same, things in order to produce responses that were similar, but not quite the same. Was there something that could be done to make this easier?

Thinking about it, I realized that I was ignoring my own advice. Intents represent what the user says, but doesn't dictate how we handle what they've said. Intents might dictate the form the response might take, but the underlying data that goes into that response is largely the same thing.

In our weather example, all of the questions ask us to get the weather for some default location for today, and then extract part of that information for the response. Not all the responses may make sense at a given moment - if we're asked about the time the rain will start, there may be no rain in the forecast, and we need to reply appropriately.

We can think of this as a "meta-Intent", or an overarching topic that we're being asked about. Dialogflow has this sort of concept built in using a field called "Action" (which unfortunately conflicts with the Assistant concept of Action), but as developers, we could group them together in our code if we wished.

Under this scheme, we can break it down this way:
  • An Intent still represents what the user says or does, making it as fine grained as we feel necessary, and possibly including parameters.
  • One or more Intents will map to a single Topic or Action, which indicates what processing we will do with this information. We'll get some data from this processing.
  • Based on both the output of the processing and the Intent that started this, we'll formulate a Response back to the user.
By splitting it into these three parts, we can more easily handle lots of different things the user might say, and tailor the response directly to how they're saying it and what they expect.

Ok, but how?

Once I realized we could break it up this way, I realized that the multivocal library already supported this type of behavior. With multivocal, the different parts are used this way:
  • Using Dialogflow, we identify the phrases that go into an Intent. We also define the Action name that will be reported for this Intent. While each Intent name must be unique, we can give the same Action name to multiple Intents.
  • In our fulfillment webhook, we can set an Action Handler that executes a function for every request that comes in for the same Action name. 
    • This loads the data that is relevant to the Action based on the Intent's parameters or other state that we might have. 
    • If there is a problem, we may set an Outent.
  • Finally, multivocal picks a set of possible Responses based on the Outent, if that is set, or the Intent name if not. 
    • Based on other conditions we may have attached to some responses, we may further filter out which Responses are valid based on the data.
    • Each response is actually a template with markers indicating where the data we've loaded fit into the phrase being said back to the user.

Given our example above, we might build a bunch of Intents in Dialogflow and set them to the same Action name. Without showing all of them, the relevant parts of a few might look something like this:
These examples are missing many things we would see in fully fleshed out Intents, such as sample phrases that include the day we want them for or a location. It could even be argued that we would want separate Intents for rain vs snow, or that these should be parameters. However, these will do for now to illustrate how to handle them.

Each of these would require different responses, or different possible responses. Our configuration for those responses might look something like this:

  "Local": {

    "en-US": {
      "Response": {

        "Intent.weather.basic": [
          "It is currently {{weather.condition}} and {{weather.temp.f}} degrees.",
          "Currently it is {{weather.temp.f}} outside and {{weather.condition}}.",
          "Looks like it is {{weather.condition}}, with the thermometer saying {{weather.temp.f}}."

        "Intent.weather.precip": [
            "Base": {"Set": true},
            "Criteria": "{{truthy weather.forecast.precip}}"
          "Yup, {{weather.forecast.precip}} is in the forecast.",
          "Looks like we can expect {{weather.forecast.precip}}.",

            "Base": {"Set": true},
            "Criteria": "{{falsey weather.forecast.precip}}"
          "Nope, looks {{weather.condition}} at the moment.",
          "Probably not, should be {{weather.forecast.condition}}."

        "Intent.weather.precip.quant": [
            "Criteria": "{{truthy weather.forecast.precip}}",
            "Template": {
              "Text": "Expect about {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."
            "Criteria": "{{and (eq weather.forecast.precip 'snow') (gt weather.forecast.quant.in 12)}}",
            "Template": {
              "Text": "Be careful! Could be over {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."

            "Criteria": "{{falsey weather.forecast.precip}}",
            "Template": {
              "Text": "We're not expecting any precipitation right now." 

The definition of responses for the "weather.basic" Intent is the most straightforward. We provide three possible responses, all of which may be used. In each, we use a weather object which we'll discuss momentarily, but which has fields that store the condition ("clear", "partly cloudy", "raining") and the temperature that we include in the result.

The other two Intents are a little more complicated, but similar to each other. For the "weather.precip" Intent, we list a number of possible responses again, but this time we will set criteria for some of those responses by setting "base" values, which will be applied to all later responses (until changed). So we will have two responses whose criteria are if the forecasted precipitation field in the weather object is set to something such as "rain" or "snow", indicating that precipitation is in the forecast, and two whose criteria are the opposite. Those responses use either the forecasted precipitation type or, in one case, the current conditions (since theres nothing in the forecast).

The "weather.precip.quant" Intent is similar, but instead of setting base criteria, we include the criteria directly in the response itself. So when there is precipitation expected, we have an additional possible response if there are more than 12 inches of snow expected.

But how is the weather object itself populated? Through an Action handler function we write and then register when we setup our webhook

function handleWeather( env ){
  const location = env.location;  // Assume we build this somewhere
  const date     = env.date;      // Assume we build the date too
  return getWeather( location, date )
    .then( data => {
      const weather = {
        // Build the weather object from the data we get back
      env.weather = weather;
      return Promise.resolve( env );

Multivocal.addActionHandler( "weather", handleWeather );

I've omitted lots here, since I don't want to go into details about calling an API, but I'm going to assume we have a few things:
  • We get the location and date somehow. This could be by the user saying so, us asking for the location, assuming defaults, or something else.
  • With this location and date, we can call getWeather(), which uses an API of some sort to get the data. Since this is probably asynchronous, we'll return a Promise that resolves to the data with all the weather information.
  • From this data, we have everything we need to build our weather object that we use with the responses.


Separating our business logic from our responses, and having our responses be more closely tied to the Intents that the user triggers, seem like a good solution to the problem of having to make many Intents that vary only slightly from each other.

Different platforms may handle this differently. It is worth noting that Samsung's Bixby platform comes with tools that work similarly to what multivocal offers, while Amazon's ASK would require you to build something that bundles the Intents into Topics yourself. But conceptually, the idea is sound and useful for no matter what platforms you design and develop against.

My thanks to Rebecca Evanhoe for her presentation that sent me down this thought path.


Thinking for Voice: Context is Queen

When we think about designing and building for voice, we often say that we need to pay attention to the context at each stage of the conversation. There are many aspects to what "context" means, however, and it isn't always clear if something is (or needs to be) contextual or not or how contextual something should be.

Context for the comments

I ran into this sort of issue while working on an Action this week. Users can work on different kinds of files, and asking "what can I do" depends on which type of file they're using. Making this help contextual was obvious and straightforward - just provide commands that make sense to this file.

But is that all of the context that we need to be aware of? Possibly not. Some of the users may not have linked their account, so only have limited access to some files. Other users may not be permitted to write to the file, just to read from it. Some features may be premium features.

If we give help, do we tell users about everything that is possible? Or just what they have access to?

In a visual world, we could display a large generic help page, covering all possible options available at that time, indicating which ones are restricted or require additional permissions. We might even use this to upsell the premium version while comparing the features available between the various versions.

We don't have that kind of luxury with voice, however.

With voice, we need to deliver our message succinctly, providing what the user needs at that moment,  but no more than they want, while also making them aware of what else they can ask for if they wish. It is a delicate balance - too much information can be overwhelming, and they don't really get all of it. Too little, and they will get frustrated that they keep having to ask for more information.

While discussing my dilemma with others, they helped me realize that the message I had been using was giving people the impression that they could do more than they could at that moment. While I had been working with concepts borrowed from "greyed out" menu items that I was helping them understand what they could do in the Action.

The difference is important, and making sure we understand what the user expects when they ask the question is important as well.

(One question raised is how often users actually do something like this. It is a good question, and one I'll have to delve into another time.)

Crafting Context

So how did I fix my problem? Using the multivocal library, responses are just templates and can be keyed to an Intent, so it was easy to use a different template in each scenario. Multivocal's response configuration can also specify under what criteria the response in valid by evaluating environment settings using handlebars.

Since I was already setting an isDemo environment setting in the existing User setting, I leveraged this. I made one response valid if the isDemo environment setting was true, and the other if isDemo was false. There are also references to configurations for suggestion chips and a card that links to the website - both of these are used in other responses, so they're just included here.

It looks something like this

  Local: {
    und: {
      Response: {
        "Intent.filetype.help": [
          {Base: {Ref: "Config/Local/und/linkCard"}},
          {Base: {Ref: "Config/Local/und/suggestions"}},

            Criteria: "{{not User.isDemo}}",
            Template: {
              Text: enHelpText

            Criteria: "{{User.isDemo}}",
            Template: {
              Text: enHelpDemoText

The enHelpText and enHelpDemoText are just JavaScript constants that contain the text for each scenario.

const enHelpText = `
  You currently have a file open. 
  You can say things like:
   "Tell me about this file",
   "Add a record",
   "Set a value", or
   "Get a value". 
  Visit example.com for more examples and help.

const enHelpDemoText = `
  You currently have a sample file open, 
  so your commands are more limited.
  You can say things like
   "Tell me about this file", or
   "Get a value".
  Visit example.com for details and further help.

In Conclusion

When thinking for voice, we have to keep context in mind for everything we say to people talking to our assistants. Not just the context of what their question was, but how skilled they are, what permission they have, and dozens of other contextual factors that we may have. Above all, we need to keep this context in mind to make sure our replies guide them towards the information they want, understanding what state they're in, what else they can do, and how they may be able to learn more if necessary.

This post was based on a tweet I sent out, pondering the issue. My thanks to Cathy Pearl, Rebecca E, Jeremy Wilken, and Siddharth Shukla, for their input and discussion on the issue. Do you have thoughts on this? I'd love to hear them on the tweet above!