First of all, I’m going to be switching this blog into a wordpress blog very soon. It’s just too much maintenance for me and it takes me unnecessary time to add features that I want. With wordpress everything will look a lot nicer and work a lot better and it’ll just make things easier for me. I’ve gotten to the point where I’d rather focus on writing than doing site maintenance and debugging.
The other main reason for doing this is that I’d like to make an add-on for wordpress that tries to automatically generate reasonable tags for the writer’s entry. I’m switching over so that I can test this on wordpress. I would do it on my own framework except I haven’t really finished my tagging system and I don’t plan to so I have nowhere to test. So it’ll be a lot easier to just build off of a more robust system like wordpress and not have to worry about all of the details that they’ve taken care of for me, so that I can focus on the project at hand.
I came up with the idea for this project after talking to a company called Metaweb at an internship fair on campus. They’ve built an online query-able database of the world’s information (or some of the world’s information) so that developers can build application on top of it and take advantage of all of the nicely structure data. The database is Freebase. It’s fun to play around with, but I see it as pretty useful for building applications that require information for external sources (like the automated tagger).
So I was playing around with Freebase, and I thought that it would be really cool if I could automate tag generation on my blog (mostly because I usually forget to do it anyway). With this structure data I thought that maybe I can take advantage of the tagging that takes place on Freebase, and all I’d have to do is find out which tags to take from there. I could just look for keywords in the post, send queries to Freebase, look at the significant tags, and suggest some of the tags for the post. Sounds pretty straightforward.
The general plan of attack is: keep a postings file (just a word -> entry relationship ) of all the significant words in all the posts of this user. Then when a new post is published, look at the significant words and find the ones that seem like good candidates for tags (via some text search algorithm like TF-IDF). Then for each of these, look at the Freebase entry for them and pull out the tags on Freebase for these words. Suggest these tags (or some subset of these tags) as tags for the article. Then allow the user to choose which of the suggested tags he/she likes.
Obviously this is pretty high level plan of attack and I’ve been told that it’s not going to be that easy, but I certainly think it would be cool. Also, it occurred to me that it didn’t have to be Freebase that I’m querying. Maybe I could just look at previous posts from this blogger that contain a similar set of candidate words. There are definitely variations of this that may end up being better solutions, but I’ll play around with all of that when I actually start working on it.
I see this as useful because it not only allows my tags to be thought up for me (which is very convenient), but it could also keep my tags consistent with each other (I won’t have tags for “photos” and “photography” for example). That would drastically improve the readability of a blog as there would be less tags and stories would be categorized better. In these respects I think it’s a pretty worthwhile project to take on.
Also I think it’s more challenging than the stuff that I usually do. It deals with text search algorithms, efficient and appropriate data storage, and a fair amount of artificial intelligence (as in “is this word significant based on what this user has previously written about and what’s he’s writing about in this article?”). Fortunately, I’m currently taking a databases course where I’ve already learned about text search and of course data storage. I’m also planning to take an AI course next semester where I may learn some concepts that I can put to use in this project. Again, I haven’t really spent too much time on the project yet, apart from downloading wordpress and going through some of the code, and I may never get around to it if something more important comes up, but I think it’s a pretty cool project that I would enjoy working on. Hopefully I’ll find some time to get it done sometime soon.