AI Awesomeness: The Speaker Recognition API

Published on 27 Nov 2017

Microsoft have been consistently ramping up their AI offerings over the past couple of years under the grouping of "Cognitive Services". These include some incredible offerings as services for things that would have required a degree in Maths and a deep understanding of Python to achieve, such as image recognition, video recognition, speech synthesis, intent analysis, sentiment analysis and so much more.

It's quite incredible to have the capability to ping an endpoint with an image and very quickly get a response containing a text description of the image contents. Surely, we live in the future!

Bit biased?

Not at all! Amazon and Google have their own platforms which offer some similar functionality, such as AWS Lex and Rekognition for intent and image analysis respectively. Comparing the offerings can be an interesting exercise, should you want to.

Speaker Recognition

One of the more recent Cognitive Services offerings is the Speaker Recognition API which has two main features: Speaker Verification and Speaker Identification.

Speaker Verification

The first one is intended to be used as a type of login or authentication; just buy talking a key phrase to your Cognitive Services enhanced app you can, with a degree of confidence, verify the user is who they (literally) say they are.

Each user is required to "enrol" by repeating a known phrase three times, clearly enough for the API to have a high confidence for that voice print. You can set this phrase yourself, which could be fun...

Could this spell the end to the concept of the password? The usually-just-barely-complex-enough-to-pass-validation word with a capital letter at the start, and a number and an exclamation mark at the end that you increment the number for and reuse everywhere (am I close?).

Yeah, that needs to be fixed. For those of you not willing to move to something like LastPass or 1Password (and you really really should if you can), perhaps reusing a phrase everywhere that is unique to your voice print is the answer.

However, one of the demo enrolment phases is indicative of an old nerd (like me) trolling the entire concept of using your voice as a means of authentication: "my voice is my passport, verify me" is the same voice print authentication used in the 1992 movie Sneakers, which sees Robert Redford and his gang "hack" into a system using a spliced recording of someone saying those key words:

If you haven't seen this movie already, get on it! I love it. Classic retro hacking action. Plus River Phoenix is in it.

Speaker Identification

I really love the idea of this one and will be using it at meetups and conferences as soon as I can; instead of repeating a preset phrase three times, the user speaks for around 15 seconds or thereabouts (you get to define this) which is used to determine a voice print.

From now on you are able to send a segment of audio to the Speaker Identification API and receive a speedy response which returns an enrolled user's id and a confidence level. This won't give you the high level of confidence you might get from the verification endpoint, but it will give a handy "I think this person is talking" response.

Imagine having a selection of panellists at a conference who have been enrolled, then as they're speaking that panellist's picture, name, and bio appear behind them on the screen automatically. Cool stuff.

Go try them out

Pop on over to the Speaker Recognition API page to have a play for yourself. Is this the future of passwords? What do you think?

Call us on 020 3137 3920 to find out how we can help