Achieving a truly conversational and intelligent voice interface is the holy grail of the IVR industry. It holds the promise of enabling customers to carry on a productive conversation and accomplish things through self-service as easily as talking with an agent. However, as most people have come to realize, the majority of IVR systems today are a long way from providing a quality agent-like experience. This is readily apparent in the most advanced IVR systems that have an open ended opening prompt such as: ‘‘How can I help you’. A likely follow up response is ‘I’m sorry I did not understand.’ The reason this occurs is that although speech technology is evolving, it does not have the understanding capabilities to match that of an average customer service representative.
Automated speech recognition and natural language understanding technologies have been around for decades. Speech recognition is the raw conversion of spoken speech to text without any interpretation of meaning. You can think of it as pure word transcription. You say something like "I need to go to the store after work to pick up some bread". Automated speech recognition converts the words to text without any idea of what the words mean. That's where natural language understanding comes in. Natural language understanding's job is to interpret what the words mean in the proper context. To accomplish natural understanding, a system must first accurately recognize the words and then derive the meaning. This requires not only recognition of the individual words, but also understanding of what they mean strung together in a particular way.
In a customer care environment advanced natural language understanding capabilities are critical in determining the needs of the caller for proper routing or self service. If a system can't figure out what you need, how can it even begin to help you? Consider the following statement: 'I just cracked the display on my phone and need to replace it’. Nearly 100% of call center representatives would be able to understand that statement. That's because we are wired to interpret the sounds of language without much conscious thought. Not to mention the fact that we also received extensive language training starting when we were infants. However, getting a computer to understand a sentence such as this is an entirely different matter.
First of all, the computer would have to do an adequate job of recognizing the words in the sentence. A recognition error on any one of the words could impact the meaning. If the system does happen to recognize the words correctly, it then needs to interpret their meaning within the proper context. To accomplish this task, a designer would have to associate a meaning to a phrase such as this. In this case, it might be something like "need replacement display" or "display problem". The designer not only needs to match all the phrases that might mean "display problem", but all the other relevant things such as shipping problems, new sales, etc.
Training natural language speech systems is done by listening to, transcribing, and categorizing thousands of utterances. Designers then assign meaning to the highest probable 3-5 word combinations creating what is referred to in the industry as a statistical language model or SLM. The process to build and tune these SLMs are both time consuming and expensive, and in the end, they still cannot come close to matching the understanding of your average person. There is an inherent challenge in building natural, conversational systems because they require very large language models. Additionally, when you increase the size of the models the overall accuracy goes down. This occurs because there is a much higher probability that the system confuses two similar sounding words or phrases.
Traditional speech recognition and natural language technologies have continued to evolve, but they still have a long way to go before they have the capabilities to carry on a conversation that matches that of a high quality agent. Fortunately however, there are human assisted IVR solutions that bridge the gap and enable truly conversational interactions today. Advanced systems such as these have the ability to send a speech recognition request to either an automated speech recognition engine or an actual person.
Human assisted speech technology works by applying rules which determine when utterances are routed to an automated speech recognition engine for processing or sent to a specially equipped analyst. One of the primary rules is dependent upon the prompt provided to the caller. A prompt such as ‘What is your phone number?’ would likely be routed first to the automated speech recognition engine. Conversely, a prompt such as ‘what is your email address?’ would be routed to an analyst for recognition. Additionally, any time the automated speech recognition is unable to interpret the response, the utterance can be re-routed to the analyst without re-prompting the caller. This greatly reduces prompts that often frustrate callers such as ‘I think you said...’ or ‘I'm sorry I did not understand, can you please repeat’.
The human assisted approach improves the capabilities and overall experience. Most importantly, it enables callers to accomplish their tasks as easily as working directly with an agent. It also provides the benefits of increased self-service to the business
As more companies adopt this approach, the IVR industry will win back many of the callers who have become frustrated and given up on IVR systems. Finally, when customers are greeted with "How may I help you?" the systems will have the understanding to actually help.
Phil Gray is executive vice president of Interactions, a vendor of customer interaction services to businesses.
TMCnet publishes expert commentary on various telecommunications, IT, call center, CRM and other technology-related topics. Are you an expert in one of these fields, and interested in having your perspective published on a site that gets several million unique visitors each month? Get in touch.
Edited by Rich Steeves