It’s time to finally let the cat out of the bag. I announced about a month ago that I was working on a really cool project and now I’m finally going to talk about it. From the title, it’s pretty obvious that the project involves voice recognition and media libraries. Well, here we go!
I worked at a company called Tellme this summer, and all of their technology revolves around voice recognition. Anyway, after working on voice applications for them, I thought it’d be really cool to extend iTunes to take voice commands. I talked to some of my fellow interns about third party voice recognition software that I could use, and got started.
First, let me talk about some of the software I’m using. My voice recognition tool is called sphinx. It’s an open source Java package provided by Carnegie Mellon that’s really easy to extend. Sphinx was a bit difficult to install, but once that’s done, they provide a lot of demos that you can just look at and add on to. I just took their “hello world” demo, and added a bunch of stuff to it to get my media library recognition to work. If you’re at all interested (and because you’re still reading, I take it that you are) I strongly recommend checking out their software, playing with it and extending it to build your own applications.
On the other side of the application, I’m using a perl module to control iTunes. The module just has functions that output some applescript, so unfortunately this project can only be installed on top of OS X right now, but it’s a really simple, easy to use interface to applescript and therefore iTunes. This is what I really love about perl; there are so many libraries for perl that you can do pretty much whatever you want, just with installing a couple of modules onto your computer. I’m also using perl for XML parsing and other file handling, while I’m using Java for the user interface and interaction to sphinx.
Ok, so what exactly does my prototype do? Essentially, I have a java program that starts up the sphinx recognizer, waits for the user to say something like “itunes play”, “itunes pause”, “itunes next”, or “itunes select”, and then it processes the command, and sends a command to iTunes. “Play”, “pause”, and “next” are self explanatory, but “select” is a little more intricate. When the users says “itunes select,” the program prompts you to say an artist name, and then a track name by that artist and then it searches your iTunes library and plays that specific track. That’s the program from a high level.
Looking deeper down, the Java code is just a loop that waits for recognition, but I use different grammars each time. The main procedure just has a grammar that accepts the “iTunes” commands. Each command outputs a different string, which I send as the argument to a perl script. The perl script then uses the applescript library to send a command to iTunes.
Prior to loading the recognizer, I use perl’s XML libraries to parse “Itunes Media Libary.xml,” the file that stores all of iTunes’ track information. Then when the user says, “itunes select,” I dynamically generate a grammar composed of a list of all the artists in your music library. Then when you say an artist name, I dynamically generate a grammar of all the song titles. Once I have the artist name and the song name, I pass the information into the perl script, which again sends the commands to iTunes. On the whole, the project doesn’t seem too complicated, but there’s quite a bit of code involved, and it’s definitely more challenging than any of the other projects I’ve worked on.
My current prototype is fully operating as I just described, but it’s certainly not complete. Before I actually publicize and maybe distribute my software, I want to fully integrate it into iTunes as a plug-in. I also need to improve recognition, because my ultimate goal is to just leave the plug-in on, but it should know when it should be listening and when it shouldn’t. Finally, I also want to let the system learn artist and track names, because as of now, it only understands legitimate English words and a lot or artists names aren’t English words. I want to build a feature that allows you to train the system when you first run it, so that it learns this names and recognizes them for later runs.
Of course all of my aspiration and additions are a lot more challenging than just getting my prototype running, but I really think this project is cool and that it’s worthwhile for me to spend my time on. It would be amazing to show off to my friends, peers, and, of course, recruiters. What’s more important, though, is that I’ll learn a lot about building a stable, user-friendly, product that integrates a lot of different technologies. I hope to make this a project that I take from start to finish, spending time on all aspects of product development. It’s a challenge, but it’s been fun so far, and it’ll be so worth it when it’s complete.