Speech recognition is important for VR to help simulate conversations with AI agents and to allow users to communicate with any app that requires a great number of options. Yeah, you can type commands but this can be too impractical, and overcrowding an app with buttons or GUI elements could make the app very confusing. But any user who is capable of speech can easily speak while they are in VR mode.
"Here’s the core idea: thinking out loud is often less arduous than writing. And it’s now easier than ever to combine the two, thanks to recent advances in speech recognition technology." - Descript
Speech Recognition Personal Assistant (U)
Unity Labs’ virtual reality (VR) has this personal assistant called U. This allows the user to speak in order to easily perform certain actions. They have been researching speech recognition and analysis tools that could be used to implement various voice commands. It uses semantic analysis technology that involves a constant learning process for its AI system. Its AI is good at using complex neural networks to connect words and phrases and determine the user’s intent. It also can pick up on user's speech patterns and tailor its responses, and figure out what to expect them to say in specific scenarios.
Unity Labs’ initial research on speech recognition has also involved the evaluation of existing speech-to-text solutions. They have developed a package for the Asset Store that integrates several of these solutions as Unity C# scripts. This package comes with a sample scene that compares side-by-side the text transcriptions from each API. It also allows users to select a sample phrase from a given list, say that phrase, and see how quantitatively accurate each result is.
The speech-to-text package interfaces Google Cloud Speech, IBM Watson, Windows dictation recognition, and Wit.ai. All of these respond to background speech really well, but some of them, such as Windows, will insert some words at the beginning and end of the recording, probably picking up on some of the beginning and ending background speech that is not obscured by foreground speech.
This research was motivated by Carte Blanche’s plan to integrate AI agent U to respond to voice commands. This requires speech-to-text transcription and keyword recognition. The challenge that we're seeing is that it's very difficult to create an agent with whom the user can have a conversation. We often speak in sentences and throw in “um”s and “ams”s and words that reflect our feelings. If an AI agent in a VR app can understand not just keywords but every part of our conversational speech, then it will introduce a whole other level of immersion inside the VR environment.