How do we do this for tasks we have spent years building visual tools around?
We need to think about the semantics of what we're talking about - what something means, and what it is made of. Not just the syntax - a set of rules about how to arrange something. Finding the semantics helps us understand and describe the bigger picture.
As such, sometimes it requires us to step back and think. Here are a few examples.
Smart Home control
Perhaps the most straightforward way to illustrate this is by looking at how we use our smart home controls now. Saying things like "turn on the lights", "turn off the living room tv", or "is the sprinkler running" are very direct, syntactic, commands that we would expect.
But users are also expecting that saying something like "good night" would turn off the lights and make sure the house alarm was enabled. Saying "I'm home" might turn on the lights in the entry way. Other, similar, commands don't specifically say what the lights should do - but do provide the context that we expect "something" should happen.
Systems that only understand us syntactically can only respond to direct commands, but would become frustrating if we had to issue several commands if we wanted to turn on or adjust several lights. Systems that also understand the semantics of our statements, for example what we might expect when we say "good night", are at the heart of where voice needs to go to become truly powerful, and not just for appliance control.
Games
Many games seem like they wouldn't work well with voice, at least given current levels of technology, but there are quite a few that seem to have found a good place in the voice world. Games that emulate TV game shows or bar games, for example, seem well suited for this because we see how voice uses them today. Text adventure games and escape room games, that can be played at the user's pace, also work well.
For others, it may be more of a challenge to think about. It is difficult, for example, to imagine controlling a character that is trying to move in real-time when our voice processing takes a second or two to handle. Many of our action games expect hair-trigger responses, and our voices aren't up to the task. It can seem quite absurd to think about playing a classic video game by shouting "left! left!" to try and move characters.
But all of these are syntactically driven - issuing specific commands. What if we shifted to more giving instructions, perhaps even almost guidelines, stating what our intent or objective is, rather thane exactly how to achieve that objective. This becomes a semantically driven game.
How might that work? It depends on the game.
What would a solitaire card game look and sound like if voice driven? A raw translation of a current game might have a conversation like this:
Game: "What row to select card from?"
Human: "Third row"
Game: "You can only move the bottom card. Where do you want to move it to?"
Human: "Fourth row"
Game: "Ok. You can pick a card or deal more."
and so forth. This sounds very stilted, but is strictly command driven. What if it was more semantically driven?
Game: "You have a few moves, or you can deal more cards. What do you want to do?"
Human: "Move the five to the six."
Game: "There you go. What next?"
Human: "Deal"
Once we start thinking about games in terms of semantics, the entire notion of some of how we play games change. Currently war games, for example, require "Generals" to make minute decisions about troops and issue those commands. What if the game starts you as a low-level officer, who needs to issue fairly tactical commands, and you get promoted higher and higher, having to deal with issues on a more strategic level? This can be done today, but it lends itself better to voice control because this is more like the real-world the game seeks to emulate.
Spreadsheets
I would agree, it is weird. And probably a lousy Visual User Interface (VUI). It is how our modern spreadsheets are currently built, but let's take a step back and think about what spreadsheets actually represent and how they're actually used today. They're a way to organize data in a structured form, with several different forms being the most common. It is pretty common to look at a spreadsheet, for example, and see column headers along the top, each row representing some chunk of data, with the most identifying part of that data to the left, and details about it further to the right.
This structure provides the semantics of the spreadsheet. Rows become individual records - we tend to work on things a row at a time. Columns are fields in that record that we can identify by name. Sometimes we may want to look something up by one value, and find out other values in that row. As a spreadsheet, we may enter some values, and compute other values in a row.
Once we begin to understand the semantics, we no longer need to issue commands about how to move around the spreadsheet - that would be thinking of the spreadsheet syntactically. Instead, we can carry on a conversation surrounding the semantics. We can ask it to add a record and be prompted for what values to fill in. We can locate some information. We can ask "what-if" questions.
If we think about how a spreadsheet is used, we realize that this is exactly how it is used. Someone may ask a question, either to themselves or to someone else, and in order to find out the answer, we may organize the data in a spreadsheet and do some calculations through the sheet to figure out the result. The question is phrased semantically, and the answer given the same way.
Computer Programming
Maybe. Programs aren't really "real world" constructions, and they don't have the same constraints that we might find in games or spreadsheets. They are much more free-form. (If they weren't, we wouldn't have so many different kinds.)
Which is why it makes sense that the initial rounds of programming with voice focus on syntactic accessibility systems to allow programmers to continue working, largely the way they have before. Tools like Talon provide ways to create "voice shortcuts" to move around a code base, add code (including whatever flavor of syntactic sugar the language requires), and perform other things using the language and framework they're used to.
This is important, to be certain. But can we do better? Can we look at how we think about and write software now and possibly turn that into a set of voice-directed tasks?
Again, maybe.
This is certainly more challenging than providing a way to update a spreadsheet or control our lights. We need to take a big step back and think about... well... think about how we think when we write programs.
One common way to do so is to start with a big task, and to break it into logical, smaller, parts. Let's look at a very simple example
- Ask for a number, which we will call N
- Print something based on N
we might break that second one down into something like
- If N is odd, the character (C) to print will be a star, otherwise C will be a plus sign
- Print the character N times
This is simple, but already talking about it seems somewhat complex. And yet, it illustrates how we can narrate a few basic components of programming - input and output, breaking tasks down into smaller functions, conditionals, and loops. If we said it this way, and with guidance, could we actually turn that into a real, functional program? Would it scale to a large program with hundreds of thousands of lines?
Can we leverage some existing programming languages to help us do this? Perhaps building off a more visual programming language, rather than a text-based one, would be the better approach - controlling blocks in something like Scratch instead of trying to control lines in more traditional languages. Or do we need to start from a tablula rasa?
Honestly - I don't know. (I didn't know if it could be done with spreadsheets, but that is progressing nicely.) I do know that I'm thinking about all this, and starting to work on it, and I'd like to think, talk, and work with others who see this as a reasonable problem to tackle.
Conclusions
Voice is a powerful new tool that we have - but most of our work with it to date has been focused on the syntactical approach. But to truly leverage the power it will bring us, to truly make for a conversational experience, we need to re-think our approach. We need to understand the semantics of the questions and commands coming from our users.
To date, people have been telling their virtual assistants how to do something. Voice becomes incredible, however, when people are able to express what they want, instead.