I love doing fun coding projects.
I was listening to the Syntax podcast yesterday, they did a show about “Nifty Browser APIs“. That is, things you can get code to do in a browser, like locate where you are and play sounds and… do speech recognition…?
“That sounds fun”, went my train of thought, “I’d love to tinker with that and see how it can be used.”
“Oh… wait… I could hook up speech input to Turbo Admin – my command palette plugin and browser extension for WordPress. Then you could control WordPress (to some extent) with your voice!”
Cue visions of the Minority Report computer:
Or “Alexa for WordPress”
So I set to work and, about 90 minutes of hacking later, I had a working prototype…
This is pretty amazing. I mean, I know it doesn’t really SEEM all that amazing. It’s a bit clunky in this first attempt.
But if we compare what we have here to where we started – the standard WordPress admin dashboard – a lot of cool stuff has happened, and it’s the combining of all these things together that has made this voice navigation possible:
- Turning the WordPress menus into a command palette
- Adding “fuzzy” search to the palette to allow direct access to commands and content
- Adding on top of that, the speech recognition to “type” with your voice.
If you think about it, each of these things in itself is no mean feat. Although the web speech API does a LOT of the heavy lifting for me.
And it was really fun combining these existing technologies to bring something new to life.
With a bit more work I got here, which is a bit more “hands-free”…
Limitations
This technology is not without limitations. Primarily, it only works in some browsers. There is no Firefox support; Chrome works (this is what my tests were in); Safari works in theory but I’ve not tested it.
It should also be noted that Chrome does its speech recognition in the cloud, so there are privacy concerns over that:
Note: On some browsers, like Chrome, using Speech Recognition on a web page involves a server-based recognition engine. Your audio is sent to a web service for recognition processing, so it won’t work offline.
Speech recognition API docs (mozilla.org)
It accepts all of the limitations of Turbo Admin. It just lets you do navigation. It’s not intelligent in any way.
And the prototype is English only. Though I believe you could pass your browser’s language setting to the speech recognition easily enough.
Implementation and challenges
In theory there isn’t really anything very special going on here. The basic steps are:
- Initialize the speech recognition API.
- When the command palette is shown, start recognition.
- The speech recognition API sends events when text has been generated from speech. It basically says “I recognized some text. Here it is!”
- Listen for the events, and when they are triggered, grab the text, shove it into the input field, and trigger a palette update.
- When the command palette is closed, stop recognition.
The format of the data given to you by the events takes a bit of figuring out, but that’s not too hard.
The real difficulty for me was that I was writing this into the browser extension. This works by injecting a “content script” that runs the whole show.
This content script is “sandboxed”, which means it has access to the content of the page, and can change that. But it doesn’t have access to other JavaScript on the page. And, sadly, this means it doesn’t have access to the speech recognition API, as tihs is a property of the window
object.
The workaround for this is to inject a <script>
tag into the page with the code that needs to use the voice recognition. BUT… this is now in a different sandbox to the command palette code. So some communication is needed between the palette and the speech recognition.
This needed some custom JavaScript events to be triggered and listened for.
Nothing too taxing, but the architecture is a bit weird.
If I added this to the plugin version of Turbo Admin this wouldn’t be a problem. BUT… fun fact… the plugin version of Turbo Admin uses (almost) identical JavaScript – the only real difference is how it’s added to the page and initialized. So it should be trivial to port this to the plugin.
So… err… what now?
Honestly, I made this as a toy. A developer plaything. Just because I could. I didn’t think it had any real practical value.
When I shared it on Twitter, the initial reactions backed up my self-amusement and internal amazement at what modern tech can do.
But then people started talking about accessibility:
I had honestly never thought about this. But perhaps this combination of tech does have some real uses?
What do you think? Should I work on it some more? Build it into the plugin? Let me know.