Recently I tried the Amazon Echo for the first time. I had always thought the voice control was annoying and unnecessary. However Alexa stole my heart. I don't want to get into a discussion about which voice control is better, I don't care.
What I do care about however is: with the current state of the field, how difficult is it to make a voice controlled personal assistant? Here I'm thinking of all the libraries and apis that are freely available.
So I have set out to get a sense for this question. Let's follow one of Googles sayings:
- First do it,
- Then do it right,
- Then do it better.
First do it...
To start something new like this we always start by making an example and a tiny prototype. During this spike we specifically try to get as close as possible, to the thing we think to be hardest in the project.
First things first
First we needed to find a decent speech recognition framework for Typescript. We quickly found Annyang, and started playing with it. We want some context awareness in our PA, so we cannot use Annyangs standard command recognition. Instead we used Annyang to parse everything and then build our own algorithm to match the meaning to a command.
We were also fortunate enough that HTML5 has support for text-to-voice already, so we just use that.
Calculations
Having it recognize (and answer to) "hello" took all but two minutes. So we started thinking about "what do we actually want it to do?". I know, this should have been our first question, but we were blinded by the idea of building Jarvis and becoming real life Tony Starks.
Then it hit us. We want it to build Jarvis. Or put simpler, we want it to be able to help us code. We want to be able to talk to it like we do to colleague, and then it should program what we tell it to. Let me just clearify: we don't want to build an AI, just an assistant, who we can tell "do a linear search, then sort the list, and return the median" or something.
First we wanted it to do simple calculations like 2+2. The easiest way to do this was to just eval
what was said. It worked like a charm. Only 15 minutes in and we could already access the values of variables, and add numbers.
Function calls
Function calls were more tricky. Especially because we didn't want to say "open parenthesis", or the likes. We have seen the youtube video, and we don't tell my colleague where to put parens or commas after all.
We made a function to take progressively shorter prefixes of the input, camelcase them and test if they were function names. Here is some pseudo code to show the idea:
tryEval(exp: string) try { return eval(exp); } catch(e) { return undefined; }
matchFunction(words: string[]) for i : words.length ... 0 prefix <- words.take(i) identifier <- makeId(prefix) evalResult <- tryEval(identifier); if(typeof evalResult === "function") return identifier;
Notice, that even though we use eval
this code does not actually call the function, it just finds the name.
Success! We could tell the computer to define variables, evaluate expressions, and even call functions. This was great news for the viability of the project. As the spike ended, we fulfilled our promise (to the extreme programming gods) and erased everything.
Then do it right...
For the next phase of a project like this, we start in the complete opposite end of the spectrum with all the lowest hanging fruits first. If you are very nerdy you could say that we use "shortest arrival time scheduling". We also make sure to make good decisions as this is potentially long lasting code.
Matching meaning
This time we needed a more solid "meaning" algorithm. We do have a great advantage over general AI: we only need to match the input to a command from a very small list. With this in mind we decided to flip the problem on its head, we have a list of results, what is closest to the input. The code went something like this:
foreach command : database match <- 0 foreach cWord : command best <- cWord.length foreach iWord : input if(best > distance(cWord, iWord)) best <- distance(cWord, iWord) match <- match - best;
Intuitively: for each work in each command look for a word in the input that matches, the command with the most matches wins. So we are matching the command against the input, not the other way around.
Of course we also made some normalization code to remove contractions and such, but that is pretty straight forward.
Isabella say "hello"
There is a hidden assumption in the matching algorithm: everything it hears is a command. This is not always the case, therefore we need some way to know that you are talking to it and not just saying. The way we solve that in the real world is with names, so let's use the same solution here. We needed a name that was distinct enough that we wouldn't say normally, and it shouldn't sound like other words.
For now we have settled on "Isabella", as it is a beautiful name, which no one in our social circle have.
Code written in this phase has to be a lot more maintainable, and so we have a tiny database with inputs and answers. It is trivial to add constant things like "what is your name?" "Isabella", but that isn't very fun. Therefore we built in support for "hooks" ($
), where we can specify to call a function instead of just saying the string out loud.
I think that's enough for one day, time to go to bed!
No comments:
Post a Comment