Previously on Dr. Lambda's blog:

In a previous post I presented my newest pet project: Isabella. Isabella is a voice controlled personal assistant (VCPA), like Siri, Alexa, and others. We have decided to investigate how difficult it is to make such a program. In the last post we finally deployed her to the cloud.

Now, the continuation...

Voice Controlled Personal Assistants

As Isabella has grown, I have started to grow more and more dependant on her, and indeed more attached to her. In the beginning this was just a fun experiment to see how difficult it was to make something like Alexa. At the same time I was strongly considering buying a "real" VCPA like Amazon Echo or Google Home. This doesn't seem reasonable anymore. The other VCPAs do offer a few features that Isabella doesn't have... yet. To balance it out I have decided to add a feature to Isabella that aren't available in the other assistants.

Spotify

I listen to music quite a lot. Wether I'm working, or cooking, Spotify is usually playing in the background. Again I don't want to get into a discussion about which music streaming service is best by any measure, I just happen to use Spotify. Unfortunately playing music from Spotify is not supported by Amazon Echo – in my country, at the time of writing. Of course this is due to politics and not technology. However I still want it.

Research

Spotify has great documentation for their web api. My first idea was just to get some audio stream, pipe it into an audio-tag and boom, music from Spotify. Unfortunately this turned out to be impossible. You can only retrieve a 30 second clip of a song.

This was quite the roadblock, and it stumped me for several days. I looked over the API again and again, and it just seemed to have methods for searching, and "clicking" the different buttons in the interface. In a way the commands in the API could make a remote control. Then it hit me. A remote control was exactly what I was trying to build. I didn't want to build an entire music streaming platform, I just wanted to control one.

This does have the limitation that Spotify needs to be constantly running in the background. But it does also mean that Isabella can control Spotify playing on other devices like phones or tablets.

OAuth

The first step when working with the Spotify API is to implement their OAuth protocol. Luckily Spotify's OAuth is super easy to implement due to their documentation. Most people know OAuth only from the "login in with facebook" (or google), but it can do much more. I imagine that we will use this same protocol for many APIs that we add in the future, like calendars, email, etc. Therefore I briefly explain the basics of OAuth. In my experience OAuth is difficult to grasp at first sight, so you should not expect to gain a deep understanding from this presentation.

Because repetition is good for understanding, I'll explain it using two metaphors I like. Then I'll explain it with the technical terms, because repetition is good for understanding.

Imaging that we are managers in a ware house. We have access to many areas, some of them are restricted, meaning only we have access to them. Now, for some reason we want someone else to solve one of our tasks. But in order to solve this task they need access to some of the restricted areas, that we have access to. This is the fundamental problem that OAuth solves.

The protocol states that:

you ask the person who should perform the task.
the person asks the secretary for a key to the restricted area.
the secretary calls you to ask if this person is allowed into this particular restricted area.
you confirm.
she writes an official form and gives to the person.
the person takes the form to the janitor.
the janitor makes the necessary key and gives it to the person.

At this point the person can perform the task. We could imagine the same procedure if you are applying for a job, and the company wants to know your grades, which are usually secret. The protocol states that:

you send a job application.
the company asks your school (or university) for your grades.
the school calls you to ask if this company is allowed to see your grades.
you confirm.
she writes sends a link to the school.
the company opens this link in a browser.
the browser shows the company your grades.

Finally let's take the concrete example of Isabella and Spotify. The protocol states that:

you want Isabella to control Spotify, so you send a request to Isabella.
Isabella redirect this request to Spotify, adding some authentication information, so Spotify knows who "Isabella" is.
Spotify then presents you with a "this application wants access to these areas".
you click confirm/continue – ie. sends a request to Spotify.
Spotify redirects this request to Isabella adding a special token.
Using this token Isabella sends a request to Spotify.
Spotify returns an access_token.

Basically every call in Spotify's API requires this access_token.

The first step

The first step in the protocol is to show that you want Isabella to take control of Spotify. The standard way is to have a button, and that was my first approach too. This is because the first time you click it, it takes you away from Isabella, and you are confronted with a screen. From a Human-Computer Interaction view point, this view change is feedback, so it is fitting to have a button. However, any subsequent times you click it, Spotify remembers your consent and just sends you straight back to Isabella without you noticing it. This means that in the subsequent cases we have a button without noticeable feedback – not good.

Common computer science knowledge teaches us that we should optimize for the common case. Imagining that you want Isabella to take control of Spotify often... very often. We only log in "for the first time" once. The common case is clearly the subsequent times, where it does not make sense to have a button. Therefore I decided in the end to remove the button, and add a command to "log in to Spotify".

Now this does cause a problem with discovery; the process by which a user learns about features. It is easy to see a button and try to click it. It is harder if there is no visual ques. However this is a general problem for VCPAs, how do you know what you can do with it? With human interaction we assume that either the receiver know how to answer our query, or we can teach them. Is this an approach we can take with VCPAs? Start with a broad basis of tasks, and the have the users teach them what they need? How should they teach it? I will certainly look deeper into this in a later post.

For now, here is a video:

Programming with Dr. Lambda

Wednesday, January 3, 2018

Isabella: Spotify and OAuth