Transcribing phone calls – why that is the easy bit!

transcribing phone calls

When computer scientists began researching automatic speech recognition (ASR) in the 1980s, the aim of transcribing telephone calls was well down their list of priorities. For one thing the audio quality of phone calls was, by today’s standards, awful and for another, general-purpose computers typically had 1/1000 the horsepower of even a mobile phone of today. 

It was difficult enough trying to get machines to recognise human speech from a high-quality microphone. Using a cheap microphone over a noisy network with a deliberately limited bandwidth, the task became unthinkable. The assumption that if a human can understand it, then a machine should be capable too, ignored the masses of contextual information – ie the background to the conversation, knowledge of the speaker, grammar, etc – that the human uses when recognising speech.

However, things have moved on. Not only is the fidelity of calls now much higher and the capacity of computers much greater but so-called “artificial intelligence” algorithms have developed massively.

High-quality, multiple speaker, continuous speech recognition services are now routinely available via more than a dozen Internet services. The advent of high-speed internet allows clients to digitise speech and send it via the internet to powerful remote computers dedicated to the task of speech recognition, then receive back the transcribed speech almost as quickly as the words are said. And these computers are running algorithms that represent hundreds of man/years of work.

With such a rich source of speech recognition services, no company that wants to use human speech as an input or control mechanism would dream of writing their own code. What is the point when such a facility costs as little as £3 per hour? Try getting a human to transcribe an hour of speech for that!

So, the recognition of speech has become the easy bit. What is far more difficult – at least so far as telephone calls are concerned – is actually obtaining the speech to be recognised. 

If you told the average business user that to transcribe a call, all they needed to do was obtain a digital file containing the call, then upload it to the ASR service, then download the transcription, you would immediately reduce the available market enormously. Leave aside setting up the ASR account, paying for the transcription, the knowledge you need to extract a digitised call from a commercial telephone system is beyond the capability and time available to most. And then when you have transcribed the call, what do you do with it? It’s no good leaving it in a file on your desktop. How will you ever find it again?

But even these problems pale into insignificance in a commercial telephone system where a call may be taken by one employee then passed on to one or more other employees. How do you track who is involved and which people the call segment relates to?

These are the issues that Threads was designed to address. Threads collects calls transparently and despatches them to the best ASR service available at the time. When the result is received it is loaded into a database that can easily be searched by any authorised employee. What is more, the call gains new relevance when seen in the context of emails.

We take for granted our ability to search our emails, yet the ability to search and locate our phone calls is something we never expect. Not unless we use Threads.