Given the advances in automatic speech recognition (ASR), you may wonder why there is not much evidence it is used to transcribe phone calls? Well there isn’t much evidence because it is rarely used. But the reason for this is only partially technical.
Why is attempting ASR on telephone quality speech more challenging than dictating into a smartphone?
In order to move a call through the telephone network, a significant amount of data is removed. Obviously this is not enough to stop a human understanding it at the other end, but it is enough to make life more difficult for the computer.
A human can tolerate significant degradation in sound quality without rendering speech unintelligable, yet the computer cannot. This is simply because the human is applying so much more knowledge to interpreting the speech – knowledge of the speaker, knowledge and understanding of the subject, matter, knowledge of the context. Advanced as computer algorithms are, they have access to nowhere near the information that a human does.
But even if ASR was technically perfect, other barriers that prevent it being routinely used to transcribe speech still remain.
Let us take an established ASR service such as Google Speech.
If you want to use it to transcribe a phone call, the first thing you need to do is to digitally record the phone call. Then you have to have a Google account. Then you have to upload a digital version of the recorded phone call – oh and be sure it has not been compressed first. Then when you get the transcription back, you have to decide who was involved in the call. And if you ever want to find it again, you have to store it somewhere with the date and time of the call and the names of the parties involved.
So even if you have the infrastructure all set up, you would probably spend more time storing the call that you would do participating in it.
Phone call transcription is about much more than converting some speech into text.
Of course, there are some companies that do have the resources to set up such an infrastructure, but they do so largely for damage limitation reasons – typically, call centres where they need to be able to recover a call if something goes drastically wrong.
But, you know, that is one of the weakest usecases you can find for call transcription.
You should see telephone calls just as you see your emails. You would consider an email system where you had to know the exact date and time of sending to find a specific email to be utterly useless. But this is the situation with phone calls and we accept the fact.
In many cases, we write emails when a phone call would be much easier and quicker. We write the email because we want a written record that we can later search. We should be able to do that with phone calls.
So what can be done?
About 10 years ago, we decided we should store email in our CRM. Then about 5 years ago, we realised the significance of phone calls and decided to include those too in our CRM. It was a revelation. Even without transcription, seeing a phone call in the context of some email exchanges can completely change the perspective on a transaction. Having seen what a game changer this was, we proceeded to apply ASR so we could search the calls, just as we could the emails. It was another revelation.
But was this rocket science?
Well it wasn’t easy but we didn’t have to develop ASR, nor an email system, nor optical character recognition, nor any of the other amazing services available in the Cloud. What we did was to glue them all together so the users didn’t need to set up a Google account, nor download an MP3, nor note who they were speaking to and when. They just picked up the same phone or used the same email app, just as they always did. And we ran all this software in the Cloud, so it could be accessed from anywhere.
We called it Threads.
And what we discovered was that by combining all this information we were able to make the ASR better on phone calls than it had ever been. Because we were beginning to do what the human does, understand the communications.