Ads
Contact
This form does not yet contain any fields.
    RSS
    Search
    Social Links
    Navigation
    Powered by Squarespace
    Twitter
    Instagram

    Entries in Siri (1)

    Wednesday
    Nov022011

    Some Comments on #Siri

    I have been using Apple's Siri quite a bit in the past while. It's extremely useful. With this entry I want to talk about a unique view people working in my field might have, some things Siri does well, and some things Apple needs to add in subsequent updates.

    My View
    For those who don't know, I work in telecommunications. Specifically on Voice Over IP technologies (think the telephone over the Internet like Skype or Rogers Home Phone). Even more specifically on systems that allow for human interaction with automated systems using your dial pad (DTMFs) and yes, your voice.

    Voice interaction systems use something called automated speech recognition (ASR) to figure out what exactly it is you're saying. There used to be - and depending on who you ask, there still are - a number of competitors in this field, including IBM, but basically Nuance is the most dominant company in the market today.

    It's pretty well understood that Apple's Siri is using licensed Nuance technology on the back end to perform the actual speech recognition. But having worked with Nuance for several years now, Nuance on it's own isn't as accurate as you might think, and never gives me results as accurate as Siri seems to provide. But then Siri itself seems to be performing some magic. To explain, I am going to quickly point form how I think Siri is working:

    • Siri prompts you for some information. This sometimes done with a question, and always done with the familiar Siri "ding".
    • When you stop speaking (and Siri does some amplitude/volume detection to determine this) or you push the button to indicate that you have stopped speaking, Siri packs up the audio it has recorded from you, and sends it out over the 'net to Apples servers.
    • The first part of Siri's magic happens here: Siri provides Nuance with the audio capture, and also provides a bunch of intelligently constructed "grammars" for Nuance to work with to figure out what you want. It's also possible that Siri is just providing Nuance with a bunch of context specific dictionary words, since it's quite good at figuring out what you're saying, and also pretty good at using your information to improve things (without fail, it spells "Katharine" in my Katharine's unique way of spelling it).
    • The next part of the Magic is what Siri does after Nuance has processed your recording, and provides results with percent accuracies (like, how likely you said, "bore" vs "boar") and other things, and then figures out from this information the most likely thing you tried to say. It also parses the result, and does some magic to intepret what you were trying to ask it to do.
    • Siri then feeds the results back to your phone, asks you for additional information, rinse, lather, repeat.

    For those wondering WHY Siri needs to be connected to he 'net, remember Watson from Jeopardy, and how much CPU was needed there? Yes, Siri is much more simple than Watson, but add in ASR, and the needed processing power increases.

    I have just explained how (I think) Siri works, and this isn't too different from how the software I work with operates, only the software I work with sends the audio you're saying IN REAL TIME to the recognition server, which has advantages - Hotword recognition, for example, where you want to be able to keep taking, and only when you say a "magic word", a certain action occurs - but doesn't work well when data is being sent over potentially unreliable data connections, as is the case with your mobile phone (hence Siri's packaging of your audio before sending it over the net).

    That said, there's an important thing Siri needs to be able to do (which I will list again in my "Things Siri Needs To Do" section): Barge in. Barge in is when, in the middle of being played a prompt (eg, "what would you like to say to Katharine?"), you say something to stop the prompt, because you know what is expected next. Having to wait for the Siri tone slows me down. The phone itself already does amplitude detection to determine if you have stopped speaking, why not use amplitude detection to determine if you have started speaking (our software does amplitude detection for various operations to determine when you start and stop speaking, so it is possible for apple to do this).

    I just wanted to get my thoughts out on this since I had been thinking about it for a while,

    Things Siri Is Awesome At
    Quick point form things Siri is awesome at:

    • Reading text messages.
    • Writing text messages (especially love the punctuation, ellipsis, emoticon, and correct spelling of Katharine).
    • Setting reminders.
    • Setting alarms.
    • Finding and playing my awkwardly named playlists (Top Rated Non-Symphonic).

    Things Siri Needs To Do
    More point form stuff:

    • Support barge in (nothing is more annoying than knowing what say next, and having to wait, or making a mistake and wanting to correct it immediately and having to wait).
    • Support Email reading. I don't know what Siri is using for Text to Speech (TTS) (possibly Nuance again), but there is a standard way to support email reading, so why isn't it done?
    • Read reminders.
    • Read appointments.
    • Read anything in the notification centre.
    • Read any alert that pops up.
    • Support email writing without having to manually enter into the email.
    • Support all sorts of email operations like Reply All, or Forward.

    Summary
    Been finding Siri to be more useful than I thought I would, but if Apple adds the above functionality, people are going to be blown AWAY by how they'll be able to use Siri more efficiently than they thought possible, and it will take the application over the top.

    L8r
    Paul