Thinking for Voice: Semantics Squash Syntactic Sugar

When we think about voice, it is natural to start from how we've done the same thing in a non-vocal world. We might look at how we've done something visually, or what we did when we just had a keyboard or even a pen. Particularly for tasks that only became common with the introduction of computers, we might not think about what the pre-computer method was, or how someone who had to use analog processes might have done it.

How do we do this for tasks we have spent years building visual tools around?

We need to think about the semantics of what we're talking about - what something means, and what it is made of. Not just the syntax - a set of rules about how to arrange something. Finding the semantics helps us understand and describe the bigger picture.

As such, sometimes it requires us to step back and think. Here are a few examples.

Smart Home control

Perhaps the most straightforward way to illustrate this is by looking at how we use our smart home controls now.  Saying things like "turn on the lights", "turn off the living room tv", or "is the sprinkler running" are very direct, syntactic, commands that we would expect.

But users are also expecting that saying something like "good night" would turn off the lights and make sure the house alarm was enabled. Saying "I'm home" might turn on the lights in the entry way. Other, similar, commands don't specifically say what the lights should do - but do provide the context that we expect "something" should happen.

Systems that only understand us syntactically can only respond to direct commands, but would become frustrating if we had to issue several commands if we wanted to turn on or adjust several lights. Systems that also understand the semantics of our statements, for example what we might expect when we say "good night", are at the heart of where voice needs to go to become truly powerful, and not just for appliance control.


Many games seem like they wouldn't work well with voice, at least given current levels of technology, but there are quite a few that seem to have found a good place in the voice world. Games that emulate TV game shows or bar games, for example, seem well suited for this because we see how voice uses them today. Text adventure games and escape room games, that can be played at the user's pace, also work well.

For others, it may be more of a challenge to think about. It is difficult, for example, to imagine controlling a character that is trying to move in real-time when our voice processing takes a second or two to handle. Many of our action games expect hair-trigger responses, and our voices aren't up to the task. It can seem quite absurd to think about playing a classic video game by shouting "left! left!" to try and move characters.

But all of these are syntactically driven - issuing specific commands. What if we shifted to more giving instructions, perhaps even almost guidelines, stating what our intent or objective is, rather thane exactly how to achieve that objective. This becomes a semantically driven game.

How might that work? It depends on the game.

What would a solitaire card game look and sound like if voice driven? A raw translation of a current game might have a conversation like this:
Game: "What row to select card from?"
Human: "Third row"
Game: "You can only move the bottom card. Where do you want to move it to?"
Human: "Fourth row"
Game: "Ok. You can pick a card or deal more."
and so forth. This sounds very stilted, but is strictly command driven. What if it was more semantically driven?
Game: "You have a few moves, or you can deal more cards. What do you want to do?"
Human: "Move the five to the six."
Game: "There you go. What next?"
Human: "Deal"

Once we start thinking about games in terms of semantics, the entire notion of some of how we play games change. Currently war games, for example, require "Generals" to make minute decisions about troops and issue those commands. What if the game starts you as a low-level officer, who needs to issue fairly tactical commands, and you get promoted higher and higher, having to deal with issues on a more strategic level? This can be done today, but it lends itself better to voice control because this is more like the real-world the game seeks to emulate.


This notion of "how we do it now in the physical world" continues when we look at voice and tools we use in business. When I tell people about Vodo Drive, and how it lets you work with a spreadsheet with your voice, I typically get a funny look. "That doesn't seem very easy to use," is a typical response, "Isn't it odd and difficult to say things like 'go up' and 'go to cell C3'?"

I would agree, it is weird. And probably a lousy Visual User Interface (VUI). It is how our modern spreadsheets are currently built, but let's take a step back and think about what spreadsheets actually represent and how they're actually used today. They're a way to organize data in a structured form, with several different forms being the most common. It is pretty common to look at a spreadsheet, for example, and see column headers along the top, each row representing some chunk of data, with the most identifying part of that data to the left, and details about it further to the right.

This structure provides the semantics of the spreadsheet. Rows become individual records - we tend to work on things a row at a time. Columns are fields in that record that we can identify by name. Sometimes we may want to look something up by one value, and find out other values in that row. As a spreadsheet, we may enter some values, and compute other values in a row.

Once we begin to understand the semantics, we no longer need to issue commands about how to move around the spreadsheet - that would be thinking of the spreadsheet syntactically. Instead, we can carry on a conversation surrounding the semantics. We can ask it to add a record and be prompted for what values to fill in. We can locate some information. We can ask "what-if" questions.

If we think about how a spreadsheet is used, we realize that this is exactly how it is used. Someone may ask a question, either to themselves or to someone else, and in order to find out the answer, we may organize the data in a spreadsheet and do some calculations through the sheet to figure out the result. The question is phrased semantically, and the answer given the same way.

Computer Programming

Does this idea of "how it works in the real world" hold up elsewhere? For example, would it work if we were trying to use voice to write a computer program?

Maybe. Programs aren't really "real world" constructions, and they don't have the same constraints that we might find in games or spreadsheets. They are much more free-form. (If they weren't, we wouldn't have so many different kinds.)

Which is why it makes sense that the initial rounds of programming with voice focus on syntactic accessibility systems to allow programmers to continue working, largely the way they have before. Tools like Talon provide ways to create "voice shortcuts" to move around a code base, add code (including whatever flavor of syntactic sugar the language requires), and perform other things using the language and framework they're used to.

This is important, to be certain. But can we do better? Can we look at how we think about and write software now and possibly turn that into a set of voice-directed tasks?

Again, maybe.

This is certainly more challenging than providing a way to update a spreadsheet or control our lights. We need to take a big step back and think about... well... think about how we think when we write programs.

One common way to do so is to start with a big task, and to break it into logical, smaller, parts. Let's look at a very simple example

  • Ask for a number, which we will call N
  • Print something based on N
we might break that second one down into something like
  • If N is odd, the character (C) to print will be a star, otherwise C will be a plus sign
  • Print the character N times

This is simple, but already talking about it seems somewhat complex. And yet, it illustrates how we can narrate a few basic components of programming - input and output, breaking tasks down into smaller functions, conditionals, and loops. If we said it this way, and with guidance, could we actually turn that into a real, functional program? Would it scale to a large program with hundreds of thousands of lines?

Can we leverage some existing programming languages to help us do this? Perhaps building off a more visual programming language, rather than a text-based one, would be the better approach - controlling blocks in something like Scratch instead of trying to control lines in more traditional languages. Or do we need to start from a tablula rasa?

Honestly - I don't know. (I didn't know if it could be done with spreadsheets, but that is progressing nicely.) I do know that I'm thinking about all this, and starting to work on it, and I'd like to think, talk, and work with others who see this as a reasonable problem to tackle.


Voice is a powerful new tool that we have - but most of our work with it to date has been focused on the syntactical approach. But to truly leverage the power it will bring us, to truly make for a conversational experience, we need to re-think our approach. We need to understand the semantics of the questions and commands coming from our users.

To date, people have been telling their virtual assistants how to do something. Voice becomes incredible, however, when people are able to express what they want, instead.


Thinking for Voice: Intentional Intents

In the voice design world, we often try to represent inputs from a user in terms of an Intent. Literally, we are trying to map from what the user says to what they intend.

Intents are a collection of actions or phrases that will all mean the same thing to the voice agent. This means that we can assign phrases such as "What is the weather?" and "How's the weather?" to the same Intent, and the system will respond the same way to both. Intents have some flexibility to them to accept parameters, so "What is the weather today?" and "What will the weather be tomorrow?" can use the same Intent, but have a different Parameter representing the day we're asking about.

In this example, there may be other Parameters for location as well, or a user may not specify the date or location, and our system should use defaults. In all these cases, however, we would be using the same Intent.

As we design our conversations, we are typically on the lookout for phrases that we may express different ways, but that should get the same result (or equivalent variants of the same result). These would be ideal candidates to boil into one Intent. But this could be deceptive.

Identifying Intents

A couple of weeks ago, I attended a presentation by Rebecca Evanhoe that included a discussion about problems we might run into properly determining which phrases are truly equivalent.

To continue our example above, would we actually expect the same result if someone asked "What is the weather?" vs "How's the weather?" Perhaps, but perhaps not. Our system might reply with something like "Looks like it will rain today" and the forecasted high temperature. If there is other important information, such as a weather alert currently active, it might mention this as well. But what other things might people ask if they were asking about the weather? Perhaps questions like

  • Will it rain today?
  • How much snow will we get?
  • What time will the rain start?
  • Should I bring in my plants tonight?
  • When will the tornado watch end?

Initially, we might think these are all "equivalent" questions - most want to know about the current weather. Yet some are more specific than others, and answering the general question will feel wrong. 

Rebecca brilliantly argues that this difference is the gap between what it feels like talking to a human and talking to a robot, and that failure to take this into account is what causes our systems to not feel either intelligent or conversational.

Humans can do this instinctively. Can we use something like machine learning to help our systems do the same? Perhaps. But they still require a huge amount of training to get the linguistics patterns correct, and once we're collecting those linguistics patterns, we have an easier solution. Sorta.

We can just build multiple Intents for phrases that are truly different, or that deserve different responses. With a good Natural Language Processing (NLP) system, we can even see what other phrases people use over time that we don't perfectly match, and attempt to either match them to existing Intents, or create new Intents for them.

This seemed like a lot of work to me, so during the Q&A, I asked Rebecca what I honestly thought wasn't a leading question: How can we make this easier?

What they say vs What you do

I was hoping for a silver bullet as an answer, but there isn't one. Her suggestion was to identify the most common phrases and determine the best and most useful responses for them. Where reasonable, combine the phrases into a single Intent, pick the most important others for their own Intent, and be prepared with a generic answer if necessary for any others.

This, of course, made sense in our design.

But as a developer, it was still frustrating. It meant writing a bunch of different Intent Handlers that all did similar, but not quite the same, things in order to produce responses that were similar, but not quite the same. Was there something that could be done to make this easier?

Thinking about it, I realized that I was ignoring my own advice. Intents represent what the user says, but doesn't dictate how we handle what they've said. Intents might dictate the form the response might take, but the underlying data that goes into that response is largely the same thing.

In our weather example, all of the questions ask us to get the weather for some default location for today, and then extract part of that information for the response. Not all the responses may make sense at a given moment - if we're asked about the time the rain will start, there may be no rain in the forecast, and we need to reply appropriately.

We can think of this as a "meta-Intent", or an overarching topic that we're being asked about. Dialogflow has this sort of concept built in using a field called "Action" (which unfortunately conflicts with the Assistant concept of Action), but as developers, we could group them together in our code if we wished.

Under this scheme, we can break it down this way:
  • An Intent still represents what the user says or does, making it as fine grained as we feel necessary, and possibly including parameters.
  • One or more Intents will map to a single Topic or Action, which indicates what processing we will do with this information. We'll get some data from this processing.
  • Based on both the output of the processing and the Intent that started this, we'll formulate a Response back to the user.
By splitting it into these three parts, we can more easily handle lots of different things the user might say, and tailor the response directly to how they're saying it and what they expect.

Ok, but how?

Once I realized we could break it up this way, I realized that the multivocal library already supported this type of behavior. With multivocal, the different parts are used this way:
  • Using Dialogflow, we identify the phrases that go into an Intent. We also define the Action name that will be reported for this Intent. While each Intent name must be unique, we can give the same Action name to multiple Intents.
  • In our fulfillment webhook, we can set an Action Handler that executes a function for every request that comes in for the same Action name. 
    • This loads the data that is relevant to the Action based on the Intent's parameters or other state that we might have. 
    • If there is a problem, we may set an Outent.
  • Finally, multivocal picks a set of possible Responses based on the Outent, if that is set, or the Intent name if not. 
    • Based on other conditions we may have attached to some responses, we may further filter out which Responses are valid based on the data.
    • Each response is actually a template with markers indicating where the data we've loaded fit into the phrase being said back to the user.

Given our example above, we might build a bunch of Intents in Dialogflow and set them to the same Action name. Without showing all of them, the relevant parts of a few might look something like this:
These examples are missing many things we would see in fully fleshed out Intents, such as sample phrases that include the day we want them for or a location. It could even be argued that we would want separate Intents for rain vs snow, or that these should be parameters. However, these will do for now to illustrate how to handle them.

Each of these would require different responses, or different possible responses. Our configuration for those responses might look something like this:

  "Local": {

    "en-US": {
      "Response": {

        "Intent.weather.basic": [
          "It is currently {{weather.condition}} and {{weather.temp.f}} degrees.",
          "Currently it is {{weather.temp.f}} outside and {{weather.condition}}.",
          "Looks like it is {{weather.condition}}, with the thermometer saying {{weather.temp.f}}."

        "Intent.weather.precip": [
            "Base": {"Set": true},
            "Criteria": "{{truthy weather.forecast.precip}}"
          "Yup, {{weather.forecast.precip}} is in the forecast.",
          "Looks like we can expect {{weather.forecast.precip}}.",

            "Base": {"Set": true},
            "Criteria": "{{falsey weather.forecast.precip}}"
          "Nope, looks {{weather.condition}} at the moment.",
          "Probably not, should be {{weather.forecast.condition}}."

        "Intent.weather.precip.quant": [
            "Criteria": "{{truthy weather.forecast.precip}}",
            "Template": {
              "Text": "Expect about {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."
            "Criteria": "{{and (eq weather.forecast.precip 'snow') (gt weather.forecast.quant.in 12)}}",
            "Template": {
              "Text": "Be careful! Could be over {{weather.forecast.quant.in}} inches of {{weather.forecast.precip}}."

            "Criteria": "{{falsey weather.forecast.precip}}",
            "Template": {
              "Text": "We're not expecting any precipitation right now." 

The definition of responses for the "weather.basic" Intent is the most straightforward. We provide three possible responses, all of which may be used. In each, we use a weather object which we'll discuss momentarily, but which has fields that store the condition ("clear", "partly cloudy", "raining") and the temperature that we include in the result.

The other two Intents are a little more complicated, but similar to each other. For the "weather.precip" Intent, we list a number of possible responses again, but this time we will set criteria for some of those responses by setting "base" values, which will be applied to all later responses (until changed). So we will have two responses whose criteria are if the forecasted precipitation field in the weather object is set to something such as "rain" or "snow", indicating that precipitation is in the forecast, and two whose criteria are the opposite. Those responses use either the forecasted precipitation type or, in one case, the current conditions (since theres nothing in the forecast).

The "weather.precip.quant" Intent is similar, but instead of setting base criteria, we include the criteria directly in the response itself. So when there is precipitation expected, we have an additional possible response if there are more than 12 inches of snow expected.

But how is the weather object itself populated? Through an Action handler function we write and then register when we setup our webhook

function handleWeather( env ){
  const location = env.location;  // Assume we build this somewhere
  const date     = env.date;      // Assume we build the date too
  return getWeather( location, date )
    .then( data => {
      const weather = {
        // Build the weather object from the data we get back
      env.weather = weather;
      return Promise.resolve( env );

Multivocal.addActionHandler( "weather", handleWeather );

I've omitted lots here, since I don't want to go into details about calling an API, but I'm going to assume we have a few things:
  • We get the location and date somehow. This could be by the user saying so, us asking for the location, assuming defaults, or something else.
  • With this location and date, we can call getWeather(), which uses an API of some sort to get the data. Since this is probably asynchronous, we'll return a Promise that resolves to the data with all the weather information.
  • From this data, we have everything we need to build our weather object that we use with the responses.


Separating our business logic from our responses, and having our responses be more closely tied to the Intents that the user triggers, seem like a good solution to the problem of having to make many Intents that vary only slightly from each other.

Different platforms may handle this differently. It is worth noting that Samsung's Bixby platform comes with tools that work similarly to what multivocal offers, while Amazon's ASK would require you to build something that bundles the Intents into Topics yourself. But conceptually, the idea is sound and useful for no matter what platforms you design and develop against.

My thanks to Rebecca Evanhoe for her presentation that sent me down this thought path.


Thinking for Voice: Context is Queen

When we think about designing and building for voice, we often say that we need to pay attention to the context at each stage of the conversation. There are many aspects to what "context" means, however, and it isn't always clear if something is (or needs to be) contextual or not or how contextual something should be.

Context for the comments

I ran into this sort of issue while working on an Action this week. Users can work on different kinds of files, and asking "what can I do" depends on which type of file they're using. Making this help contextual was obvious and straightforward - just provide commands that make sense to this file.

But is that all of the context that we need to be aware of? Possibly not. Some of the users may not have linked their account, so only have limited access to some files. Other users may not be permitted to write to the file, just to read from it. Some features may be premium features.

If we give help, do we tell users about everything that is possible? Or just what they have access to?

In a visual world, we could display a large generic help page, covering all possible options available at that time, indicating which ones are restricted or require additional permissions. We might even use this to upsell the premium version while comparing the features available between the various versions.

We don't have that kind of luxury with voice, however.

With voice, we need to deliver our message succinctly, providing what the user needs at that moment,  but no more than they want, while also making them aware of what else they can ask for if they wish. It is a delicate balance - too much information can be overwhelming, and they don't really get all of it. Too little, and they will get frustrated that they keep having to ask for more information.

While discussing my dilemma with others, they helped me realize that the message I had been using was giving people the impression that they could do more than they could at that moment. While I had been working with concepts borrowed from "greyed out" menu items that I was helping them understand what they could do in the Action.

The difference is important, and making sure we understand what the user expects when they ask the question is important as well.

(One question raised is how often users actually do something like this. It is a good question, and one I'll have to delve into another time.)

Crafting Context

So how did I fix my problem? Using the multivocal library, responses are just templates and can be keyed to an Intent, so it was easy to use a different template in each scenario. Multivocal's response configuration can also specify under what criteria the response in valid by evaluating environment settings using handlebars.

Since I was already setting an isDemo environment setting in the existing User setting, I leveraged this. I made one response valid if the isDemo environment setting was true, and the other if isDemo was false. There are also references to configurations for suggestion chips and a card that links to the website - both of these are used in other responses, so they're just included here.

It looks something like this

  Local: {
    und: {
      Response: {
        "Intent.filetype.help": [
          {Base: {Ref: "Config/Local/und/linkCard"}},
          {Base: {Ref: "Config/Local/und/suggestions"}},

            Criteria: "{{not User.isDemo}}",
            Template: {
              Text: enHelpText

            Criteria: "{{User.isDemo}}",
            Template: {
              Text: enHelpDemoText

The enHelpText and enHelpDemoText are just JavaScript constants that contain the text for each scenario.

const enHelpText = `
  You currently have a file open. 
  You can say things like:
   "Tell me about this file",
   "Add a record",
   "Set a value", or
   "Get a value". 
  Visit example.com for more examples and help.

const enHelpDemoText = `
  You currently have a sample file open, 
  so your commands are more limited.
  You can say things like
   "Tell me about this file", or
   "Get a value".
  Visit example.com for details and further help.

In Conclusion

When thinking for voice, we have to keep context in mind for everything we say to people talking to our assistants. Not just the context of what their question was, but how skilled they are, what permission they have, and dozens of other contextual factors that we may have. Above all, we need to keep this context in mind to make sure our replies guide them towards the information they want, understanding what state they're in, what else they can do, and how they may be able to learn more if necessary.

This post was based on a tweet I sent out, pondering the issue. My thanks to Cathy Pearl, Rebecca E, Jeremy Wilken, and Siddharth Shukla, for their input and discussion on the issue. Do you have thoughts on this? I'd love to hear them on the tweet above!