Blog | Crank Software

The rise of voice activation and what it means for embedded user interfaces

Written by Jason Clarke | May 20, 2020 3:42:26 PM


Voice recognition technology continues to grow in both capacity and popularity. The original technology was rather primitive – you may remember struggling with it while attempting to navigate a business’s automated attendant trying to perform even the simplest tasks. Thanks to enhancements in speech recognition technologies and cloud-driven digital assistants from big players like Amazon, Apple, Google, and Microsoft, the state-of-the-art has come a long way. The current word error rate of a voice recognition engine is around five percent, which at this point is very nearly what humans achieve (four percent). We have made huge strides: by comparison, the lowest error rate 20 years ago for unconstrained speech was 43 percent. Highly stilted grammars are also a thing of the past; instead, we communicate with our devices using natural flowing speech.

With fewer errors and improved capability of voice assistants, consumers have been increasingly receptive. It’s estimated that by 2024, consumers will be able to interact with over 8.4 billion voice assistant enabled devices. Representing growth of over 113% compared to the expectation of there being 4.2 billion devices in use by the end of 2020.  

Will the touchscreen embedded UI disappear in favor of voice activated devices?

Machine learning has given our devices a brain – the power of today’s voice recognition comes from AI models trained against massive data sets that support multiple commands given to it at once, combined with the ability to provide us with enhanced feedback.

Some think that the prevalence of voice recognition moves us toward a world of Zero UI, where interactions will not take place using a screen and where our devices will interact with us using only natural speech. However, this enhanced performance and utility can come at the cost. 

Many others, including ourselves at Crank Software, believe that voice recognition definitely strengthens the user experience but in most cases can’t replace traditional visual and tactile interactions that we have come to rely upon. 

Putting aside the thought of Silicon Valley giants capturing, categorizing and divulging our every utterance with unwelcome advertisers for a moment, sometimes voice-only services are just not able to address the task at hand. 

Even Al Lindsey, Amazon Alexa Engine VP, does not think that voice recognition will replace the graphical user interface since some experiences can really only be managed graphically. For example, if a user wants to compare ten different items, working with visuals rather than voice is going to be a far better user experience. Moreover, navigating with a graphical map is intuitive whereas navigating by voice-only can be downright painful. But it doesn’t have to be that way.

A better world of blended modalities

Combining traditional Graphical User Interfaces (GUI) with new (and evolving) modes of interaction can help create and drive great user experiences within devices. The addition of voice interaction integrated into an embedded GUI can make tasks more convenient, such as setting the oven to a certain temperature, while the visual indication of the verbal command provided by the devices UI provides users with a sense of calm that everything was understood correctly. The​ ​old​ adage ​that​ ​a​ ​picture​ ​is​ ​worth​ ​a​ ​thousand​ ​words​ rings ​true​ with a device’s GUI.   

Recently Siana Systems, a company who is dedicated to extending the user experience beyond traditional embedded technologies, leveraged a UI built using Crank Storyboard UI development tool to integrate voice services into a demo Thermostat. Their voice recognition software leveraged Alexa Voice Service (AVS) to integrate audio-front-end with hotword detection that could  work even if the Internet connection goes down and as no data is shared with external servers it’s inherently private. Best of all, for task-oriented commands, it’s just as accurate at natural language understanding as the big brand digital assistants – sometimes significantly more. As the public concern around privacy increases and regulations like GDPR and CCPA become commonplace, we expect fully embedded solutions like Snips will explode in popularity with device makers. If you want to see what this looks like click here

What does the future hold for embedded UIs?

Just because a new form of interacting with devices comes along it does not mean that we have to disregard everything we have successfully applied to previous interfaces; we will simply need to adapt our process for designing and developing products. This rings even more true when you consider that as humans we each have a different preferred method for interacting with devices (and other humans); including different levels of tolerance for those different interactions. Just think about the last time you used your voice to move through an IVR. How quick was it that you were hitting to button to speak to a live person?

Voice communication is crucial, and has enabled immediacy and flexibility that we will continue to see exploding in different IoT applications – even with vodka bottles! However, most of human interaction is nonverbal and humans process 83 percent of communication by sight. Even when we are having a face-to-face conversation, much of what is communicated is through visual cues. We believe that while voice recognition and GUIs continue to improve, their best uses will be together in a transformative, immersive user experience.  The future of UI is bright – and it’s multimodal.