With the bank’s obsession for stopping their employees from talking to their customers and with apparent ubiquity of Apple’s Siri and Amazon’s Alexa, you might be forgiven for thinking we had cracked the problem of getting computers to understand human speech – commonly called automatic speech recognition (ASR) or incorrectly, “voice” transcription. Forgiven that is, until you speak to someone that has either been forced to use ASR because there are no humans to speak to, or someone that uses it all the time to dictate into their computer or mobile phone. Rather like the Brexiteers and Remainers, people seem to fall firmly in one camp or another with nothing in between – and this is due primarily to their experience with ASR.
Like most things in life, understanding why things happen doesn’t necessarily stop them happening, but it can make them easier to bear. We all know how much better we feel at the airport when we get told why the plane hasn’t taken off. So if you are one of the lovers or haters, maybe this will help. If you don’t care, you can stop reading now.
Voice Transcription Or Voice Recognition?
But first let me clear up one major misunderstanding. A lot of people use the term voice transcription or voice recognition when they mean speech recognition. I have no idea what voice transcription could possibly mean but for sure, few users of the term mean recognising someone’s voice. Speech recognition is all about understanding what you are saying. Voice recognition is all about who is saying it. Indeed, the very last thing you want of a good speech recognition system is for it to perform according to who is speaking.
I spent a good part of my early research life finding out the effect of stress on speech. As it happens, I was then interested in the stress of fighter pilots flying at 50 ft about sea level, but the fighter pilot’s stress has the same effect on speech as the stress of getting stuck in a long queue at Tescos. As humans, we don’t expect the meaning of what women say to be different from men just because their voices generally have a higher pitch. We know when we are listening to someone who is stressed but it is exactly their speech characteristics that give this away we want to remove.
So once you start using the wrong characteristics of someone’s voice to work out what they are saying you are doomed to failure. Remember that we can easily understand a parrot speaking, and a parrot’s speech has very little in common with a human’s.
How Humans Recognise Speech
OK, so now we have got voice recognition out of the way, let’s talk some more about what the human does when recognising speech. It actually bears some similarity to our vision. Our eyes only actually focus on a very small part of what comes into our retina. We build up our mental picture, not from the bit we instantaneously focus on, but from many images and lots of previous knowledge. Our mental picture can vary drastically from the physical “photographic” image. Exactly the same is true of speech.
I have often presented speakers with computer transcriptions of their own speech that they have insisted was inaccurate. Once I play them back what they said and let them simultaneously follow with the computer transcription, they are often surprised how accurate it is; surprised that they repeated the same phrase several times; surprised that they say “like”, “um” and “arr” so often; surprised that the word that sounded so obvious was impossible to understand when heard in isolation. And what they don’t consider is that the human listener has years of experience learning the structure of language.
Last but by no means least, the human mostly knows what the speaker is talking about. With all this extra knowledge, it is less than surprising that our comprehension of human speech is often different to that which the speaker actually utters. We call all the extra knowledge “context”, and without it, we humans are not that much better than machines.
How Computers Recognise Speech
In the 1980s, when I was doing my PhD research at NPL in automatic speech recognition, I used a computer that had 128KB of memory and executed 0.75 million (16 bit) instructions per second. Today, even an iPhone6 typically has 128 GB of memory and executes 25 million (64 bit) instructions per second – one hundred times as fast and a thousand times more memory than mini-computers of the 80s.
But even with this seemingly vast amount of computing power, the ASR processing for Siri does not occur on the iPhone itself. Instead, the speech is sent via the Internet to a much more powerful computer for processing. In order for my group at NPL to achieve any sort of usable speech recognition in the 1980s, we could not rely on number crunching, we simply did not have the computing power available to do it – well not in real time. So we had to find other ways, and one of them was to use our old friend context. If we limited acceptable recognition to phrases that fitted the context of the speaker, then we could make up for lack of computing power. That still holds good.
ASR v Humans
But we humans learn quickly, and if we are sufficiently motivated, we can get ASR to perform much better than any other method. If we want to enter text, for example, the average human can speak it several times faster than type it. Even allowing for modest recognition rates, it can be quicker to speak and correct automatic speech recognition generated text than type-written text. And when the motivation is high, the performance is better. When the motivation is low, so too is the recognition.
Nobody wants to talk to a computer rather than a human, especially when you are a paying for a service where you expect a human. Hence we are less likely to modify our speech to make the process quicker. Illogical I know, but we actually want it to fail, until we discover there is no alternative.
The value of context
However, as said, if we can somehow get some context into the process, we can often improve on the intrinsic ability of the computer to recognise speech. This context might be the relationship between the words uttered, or it might be knowledge of the subject being articulated. The former may be summarised as “grammar” or the rules we use to construct meaningful sentences. These rules are fairly ubiquitous for all speakers. For example, it doesn’t make grammatical sense to say “this apprehension” but “this misapprehension” does. Even though the second syllable of this phrase matches “this” better than “mis(s)” it doesn’t make grammatical sense, so there is better to present the grammatical sensible answer. The phrase “attacks on merchant shipping” is phonetically identical to “a tax on merchant shipping” yet they mean different things. Knowing whether the speaker is an accountant or a naval commander could help get that right.
Sadly, despite their intrinsically great number crunching performance, even some of the worlds largest development teams, seem not to make best use of context. Instead, they prefer to mimic the human brain’s learning capability with techniques such as neural networks – one of many misleading guises of Artificial Intelligence. I doubt parrots are used to train neural networks on human speech. That is all well and good, but it seems a cop-out to ignore what we know of linguistics and grammar.
Yet there are other great sources of context. Digital messages, such as emails is one of them. As part of our Threads service, we transcribe telephone conversations so they can be accessed in the same way as, and together with, emails. Transcription of telephone speech poses some severe challenges not the least of which is the poor acoustic quality of telephone calls (when compared to speech captured directly from a telephone) and the fact that the speakers’ are not motivated to articulate their speech well – they are mostly unaware it is being transcribed. But we don’t need to reinvent the wheel or come up better ASR algorithm – even if we could. We can (and do) overcome these challenges by extracting context from digital messages and applying it to telephone calls. The context comes not just from the message content, but who is the speaking or writing, who they work for and where.
Conclusion
Few people who have not studied automatic speech recognition in depth can be aware of what a massively complex task the human brain performs when recognising human speech. Despite the rapid advances in technology, it is, in my opinion, unlikely we shall see better automatic speech recognition performance just through increased computing power and more sophisticated neural networks. Instead, we need to understand what the human brain is doing when it is recognising speech and learn from that as best we can. We can put up with current performance as long as we just want to switch the lights on and off, but for more sophisticated transactions we must use context. If the banks want artificial intelligence (as Alan Turing describes it) to save money, then ironically, they need to treat us more like parrots. This may take more time.
None of my words will make your bank’s computer any better at recognising your speech, but you may feel better knowing why it often appears so much worse than you expect from a human. But then again, if reading this makes you less stressed, maybe you can improve the performance. If so, then you are a lot better than I.