Speech technologies have taken off over the last decade and changed the way we interact with devices, communicate with companies and live our daily lives.
Loquendo, a provider of speech technologies for creating next generation speech-enabled applications, has, even despite economic conditions which plague today’s market, continued to introduce new solutions and languages and expand into new markets.
Not only has the companies continued profitability - boasting profits for the fifth consecutive year - established it as a financially sound and reliable offering, but it has won many industry accolades including 'Market Leader - Best Speech Engine' for both the 2008 and 2007 Speech Industry Awards, to further validate its significance in the speech market.
I took some time recently to chat with Paolo Baggia, Director of International Standards at Loquendo (News - Alert) to find out more about Loquendo and the speech technologies market.
What are some of the most significant advancements in speech today that have allowed the technology to succeed?
After a significant leap forward in the late ‘90s, speech technologies achieved many advances, including more natural sounding speech synthesis thanks to the Unit Selection Concatenation technique. Once applied, this approach gave the market very high quality synthetic voices.
In addition, speech recognition is also showing enhanced performance when it comes to increasing the complexity of the recognition tasks which are now possible with such technology. Other related technologies like Speaker Verification and Speaker Identification are also making significant advances, and are now ready to complement already existing means of authentication in the security and intelligence sector.
Can you talk about the role speech recognition is playing in the current marketplace?
Speech recognition is very effective in self-service and transaction applications, which are now much more common and often found in customer care applications. Another area that is reaching maturity today is speech analytics, i.e. the use of speech recognition to assess the behavior of customers in traditional or automated call centers.
The next challenge is to apply speech recognition to perform unconstrained searches by voice - the new field of Voice Search - along with the indexing of large databases of voice or media recordings made available by media companies.
A further emerging area is the changing role of embedded speech recognition, from simple commands and voice dialing to more complex tasks like address entry for PNDs or even Web search by voice.
How important is text to speech synthesis and where/how is it being used?
Despite the increased naturalness of speech synthesis, there are many applications that still rely on pre-recorded. I foresee an increase in the usage of speech synthesis in speech applications, as well as TTS used in embedded systems and devices to allow for hands-free interactions, or to improve accessibility. Another interesting area is the advent of virtual, human-like agents (avatars) which speak with synthetic voices and employ sophisticated animation technology that is synchronized with the TTS, to give highly realistic results.
In what ways is speech allowing for Multimodal user interfaces in mobile and Web apps and Call center applications?
Even if Multimodal user interfaces have been studied and compared with GUIs and VUIs, there seems to be today a new and broader interest in Multimodal interfaces on mobile and small devices. One driver is the need to complement speech with other modalities, for activities such as Web search, list picking, or map display. The visual modality and the voice modality are complementary, and both technologies have now reached maturity. This could be the right time to exploit market demand for multimodal applications.
Are there any challenges to still be overcome with speech technologies?
Yes, undoubtedly there are. Difficult environmental conditions, such as noise and background music, are still challenges for speech recognition today. However, the availability of more and more data is making it possible to find new, better performing error-reduction algorithms. Larger recognitions tasks, such as unconstrained search, are still at the limits of today’s technologies.
When it comes to speech synthesis, there is still a need to personalize the emotional profile and character of a synthetic voice without losing the naturalness that has been achieved. Another challenging area is the creation of expressive prompts which reflect the meaning of the text, and this is an active research area.
Can you talk a little about the Voice Search event you recently took part in and what trends you specifically noticed following the show?
Despite the difficult economic times, Voice Search 2009 in San Diego was a very successful conference, where experts on speech technologies were comparing the most recent advances in speech products and the developments in the market. The trend of moving towards voice search and the need for multimodal interfaces were both take-outs from the conference.