## Archive for the ‘cs’ Category

### A Very Simple Explanation of Spectral Clustering

Friday, March 16th, 2012

I’ve been wanting to write a post about some of the research I’ve been working on but I realized that my work has gotten really technical and not very accessible to someone not in machine learning. So I thought as a precursor I would write about spectral clustering at a very high level and with lots of pictures. I hope you can follow along. For a more technical introduction to spectral clustering, this tutorial is very nice and much more complete than this article.

Clustering

To start out, it is important to understand what the clustering problem is and why it is important. Clustering is an extremely useful tool in data analysis whereby a set of objects are organized into groups based on their similarity. We typically want to form groups so that objects in the same group are highly similar and objects in different groups are less similar. This is typically known as unsupervised learning because the data set is not labeled (differentiating it from other tasks like regression and classification).

Many applications of clustering are more exploratory, in the sense that further analysis is done individually on the clusters discovered by the algorithm. Applications include image processing, bioinformatics, information retrieval and many, many others. In image processing, one application might be to identify all of the different objects in the image by clustering the pixels. Once pixels have been clustered, it becomes much easier to tag or label the objects. There are several applications in bioinformatics, but one example is identifying groups of genes that have similar expression patterns across conditions. Identifying groups of similar genes facilitates subsequent analysis, in part because these groups are much smaller than the original set. Another application is in networking, where grouping computers on a network based on packet travel times can be used as a preprocessing step to several network optimizations. I won’t delve too much into applications but I hope that you are convinced that this is an interesting and very important problem in data analysis.

Mathematically, the clustering problem can be thought of as follows: we are given a dataset of points (objects) $x_1, \ldots, x_n$, and a number $k$, and would like to partition the dataset into $k$ groups (clusters) so that within cluster similarity is high and between cluster similarity is low. In some cases the points lie in some Euclidean (or metric) space so we can use distances as replacement for similarity, but we want within cluster distance to be low and between cluster distance to be high. Getting from a specific application to this mathematical formulation is sometimes easy but sometimes less so. In the gene expression example, we can represent each gene as a vector of its expression in each of the conditions. Thus it is natural to think of the genes as living in some Euclidean space. In the networking example, it is more natural to consider a computer only in terms of its distances to other computers. Here there is no Euclidean structure but we are still able to use a number of clustering examples.

As an example of the more abstract scenario, consider the dataset below, consisting of some points in the plane drawn from two two-dimensional gaussian distributions. One can visually see that the clustering in the second figure is a good one, and this matches our intuition that the points from one normal should be in one cluster while the points from the other normal should be in a different cluster.

I won’t say much more about clustering in general, except that typically this is framed as a minimization problem for some cost function. For some examples and some more discussion about this, please see this wikipedia article.

Spectral Clustering
Spectral Clustering is a family of clustering algorithms that have foundations in graph theory and linear algebra. I won’t dive too much into these foundations in this article, but rather I hope to give some intuition as to why this kind of algorithm makes any sense.

First, what is the algorithm. I’ll explain one simple variant here. Given our data points $x_1, \ldots, x_n$, we construct a graph on the $n$ objects where two objects are connected by an edge if they are sufficiently similar. Going with our example from above (different set of data, but two clusters from two-dimensional normal distributions), if we were to add an edge between every set of objects $x_i, x_j$ whose distance is less than 1.5, we obtain the following graph:

With every graph of this form, we can construct an $n \times n$ matrix $M$, called the adjacency matrix, where $M_{ij} = 1$ if there is an edge between $x_i$ and $x_j$ and $M_{ij}= 0$ otherwise (for simplicity set $M_{ii} = 1$ for all $i$). In spectral clustering, we look at the eigenvalues and eigenvectors of this matrix (or a related matrix called the Laplacian) and use these quantities for clustering.

Why would the eigenvectors and eigenvalues of $M$ tell us anything meaningful about the clustering of the data? Consider a very simple example, where we have two clusters and when we construct the graph, we put edges between every pair of objects in the same cluster, and put no edges across clusters. If this is the case then the adjacency matrix $M$ of the graph is block diagonal and looks like (suppose all of the objects in the first cluster come before the objects in the second cluster and suppose we only have 4 objects in this toy example):

\left(\begin{aligned} &1\ 1\ 0\ 0 \\ &1\ 1\ 0\ 0\ \\ &0\ 0\ 1\ 1 \\ &0\ 0\ 1\ 1\end{aligned}\right)

And it is easy to see that the eigenvectors of this matrix are $(1\ 1\ 0\ 0)^T$ and $(0\ 0\ 1\ 1)^T$. If we took the first one of these, the coordinates for which the eigenvector is $1$ correspond exactly to the items in the first cluster (the same is true of the second one identifying the second cluster). If I were to permute the matrix by swapping rows and columns, to something like:

\left(\begin{aligned} &1\ 0\ 1\ 0 \\ &0\ 1\ 0\ 1 \\ &1\ 0\ 1\ 0\ \\ &0\ 1\ 0\ 1\end{aligned}\right)

Then the eigenvectors are $(1\ 0\ 1\ 0)^T$ and $(0\ 1\ 0\ 1)^T$, which again exactly identify the clustering (here the first and third objects are in one cluster and the second and fourth are in the other). This is the simple justification of why Spectral Clustering makes any sense; the eigenvectors of these matrices make it really easy to identify the clusters.

I just argued that spectral clustering would work for a very contrived example and I hope that you are feeling dissatisfied by this. Some questions you might have are: What happens if not every pair of objects from the same cluster shares an edge (as in the figure above)? What happens if there are edges between the two clusters? What happens if we are looking for more than two clusters?

It turns out that it isn’t a big deal if the clusters do not form completely connected subgraphs. In fact, it is sufficient that each cluster is connected (there is a path between every pair of nodes in the cluster, not necessarily an edge), and disconnected from the other clusters for the eigenvectors to exactly identify the clusters. However, to make this claim, we have to use the Laplacian matrix rather than the adjacency matrix. A proof of this claim is in the tutorial linked to at the top of this article and I don’t really want to go into the details but please see that if you are interested. With this result, you should be convinced that spectral clustering on the graph in the third figure will recover the clustering in the second figure (which as we said is the correct clustering).

If there are between cluster edges, then in general it’s not obvious that spectral clustering (or any clustering algorithm for that matter) will work. However, as long as there are very few between cluster edges (in comparison to the number of within cluster edges) then there is reasonable high-level justification for the algorithm’s performance. Each spectral clustering algorithm has connections to a graph cut problem. All of these graph cut problems look for a partitioning of the graph that minimizes the number of edges that have to be removed in the partitioning (usually also with some restrictions on the sizes of the partition subsets). It is well known that spectral algorithms all go after these graph cut objectives (specifically they are relaxations of those problems, which are hard to solve). If you believe this argument then you should at least be convinced that if the smallest cut is easy to identify (i.e. the number of between cluster edges is very small) then spectral clustering should be successful.

Extending spectral algorithms to handle more than two clusters is actually somewhat challenging. The most popular strategy is to use the first $k$ eigenvectors (corresponding to the largest eigenvalues), represent each object as a $k$-dimensional vector using the eigenvectors, and run a different clustering algorithm in the $k$-dimensional space. If the eigenvectors are $v_1, \ldots v_k$ each in $\mathbb{R}^n$ then we would represent the object $x_i$ as $(v_1(i), \ldots, v_k(i))^T$. The high level justification (again we can appeal to the contrived case) is that the eigenvectors separate the cluster very nicely, and representing each object in terms of the eigenvectors makes it very easy for other clustering algorithms to succeed. As an example, consider the block diagonal case again. In this case, all objects from the same cluster get mapped to the exact same point (and objects from different clusters get mapped to different points) so that virtually any clustering algorithm will work.

Conclusion
If you made it this far, I hope that you have some basic understanding of what the clustering problem is and why looking at the eigenvectors of a graph adjacency matrix is a reasonable thing to do for the clustering problem. I only gave a really simple example to convince you that this algorithm makes sense, and really I didn’t even tell you what the algorithm is in detail, but maybe I gave you some insight as to why these things might make sense. Without getting into lots of linear algebra, it’s quite hard to explain what is really going on with spectral clustering, but if you are prepared for that challenge I highly encourage you to read the tutorial I mentioned earlier.

My primary motivation for writing this is so that I can explain some of the things that I have been doing for the last year or so. The main idea of our work is to prove that spectral clustering will perform well on a large family of “clusterable” graphs. These graphs have identifiable clusters, but are corrupted with a lot of noise, which makes it challenging to recover the clusters. Our results show that the eigenvectors are stable under noise, so that spectral clustering still performs well while other algorithms may not. More on this in a later post.

Sunday, February 5th, 2012

The fight against academic publishers is heating up as Tyler Neylon’s website continues to gain support against Elsevier. If you haven’t heard, the website is a place where you can publicly declare that you will boycott Elsevier, one of the academic publishers with particularly terrible practices. It may have been created in response to Timothy Gowers’ public boycott declaration, and it is supported by him and many other famous scientists and mathematicians. As of today there are 3867 total signatories, and 544 signatories in computer science alone.

First off, why is Elsevier (and most other academic publishers) so evil? In a nutshell, they exploit the work of academics (funded by taxpayers) to turn incredible profits without adding much value. Journals are run by volunteer editors (academics), papers are reviewed by volunteer reviewers (academics) and papers are written by academic researchers. Moreover, researchers are expected to prepare “camera-ready” versions of their papers, which makes the paper almost entirely ready for publication. Publishers charge exorbitant subscription fees for their journals, but their costs are minimal and their value-add is effectively non-existent.

But publishing companies effectively have a monopoly on the top journals that academics need to publish in to advance their careers. Alternative publishing venues haven’t caught on because publishing there doesn’t carry the same weight as publishing in elite journals like Nature and Science. The fact is that academia, as it pushes the boundaries of knowledge, is very conservative about accepting change. Thus, despite the fact that several alternatives have been proposed (i.e. this, this and maybe even the Arxiv along with several more abstract proposals), academia has been slow to adopt alternative venues/media for publication.

Movements like “The Cost of Knowledge” are designed to combat the inherent inertia in academia, in hopes that we can converge on a better method of publication. Once academics realize that the many of their colleagues are boycotting Elsevier/Springer/etc, it will become much more reasonable for them to boycott as well. And once the majority of a field boycotts one of these companies, either alternative publishing venues will gain credibility, or the company will be forced to change its policies/pricing/etc or risk going under.

To me, the only issue is that this movement has to involve academic institutions as a whole in addition to individual researchers. Institutions use impact factor of journals as a surrogate for research quality and use this metric in hiring and tenure decisions. Until this changes, young, untenured researchers are going to be reluctant to boycott publishing companies that run elite journals because of the career implications that boycott has. This is probably one of the primary reasons why I haven’t joined the boycott yet.

The public boycott does have some interesting side-effects. First, the fact that the boycott is public and supported by top researchers means that it is more likely to gain traction. The fact that there is a list of elite researchers who are boycotting may influence how institutions make hiring decisions, which could kick start a positive feedback loop resulting in a much more powerful boycott. A more indirect effect is that top researchers are now boycotting elite journals, meaning that the quality of those journals will decline. This might force institutions to rethink how they make hiring decisions while also enabling alternative publishing media to flourish.

Whether the boycott is successful or not, enough people are up in arms (in the blogosphere, etc.) about publishing that it finally seems that academics have enough traction to prompt some sort of change in the academic publishing system. Hopefully we’ll see some positive changes in the next couple of years.

### Some thoughts on Academic Publishing

Saturday, November 26th, 2011

The Problem
If you are familiar with the problem, you might want to skip down to the next section where I talk about some proposals that I and others have thought about.

To start out: What is wrong with the current system? It is actually quite complicated, but the main idea is that publishers (Elsevier, Springer, etc.) no longer seem to be adding any value while continuing to exorbitantly charge both authors and readers. Traditionally, the role of the publisher was to aid in distribution of academic material, and when this was legitimately a service, I completely understand them charging for it. However, now that almost all content can be obtained electronically, the role of distributor is no longer necessary. Yet publishers continue to charge ridiculous fees for journal subscriptions, which are required for an institution to obtain even electronic access to journal articles. I remember reading somewhere that university libraries spend the majority of their budget on journal subscriptions.

So why do researchers continue to publish in these journals? Well it is well known that academia is instilled with a publish-or-perish mentality, and moreover the specific venue in which you publish influences how your peers regard your work. Journals are scored by impact factor and publishing in journals with high-impact factor indicates that I am a good researcher. The quality of journal in which I publish plays a significant role in hiring decisions and other career opportunities and this, at least to me, is the primary reason why researchers continue to submit to these closed journals. There are some other factors, that motivate researchers to publish in journals, such as the peer-review system and the fact that publication is a sanity check that the work is correct and reasonable. However, I think the main motivation is to demonstrate one’s research ability. Noam Nisan talks about some other reasons and more details about this problem here.

To summarize, as it stands, the publishers provide no real value, but they restrict access to the elite journals. This motivates researchers to stick with their clearly flawed system. If we could introduce an open system to score and critique papers, with a mechanism for recognizing outstanding papers, it seems like we could break free of the existing system.

A Popular Solution
One popular solution to do this is a combination of Reddit and the ArXiv for academics. Researchers can post their papers online and then other people can leave comments and reviews of the paper. Everyone has a reputation score and the influence of one’s comments depends on their reputation. Maybe papers can get assigned scores, so anyone can score the paper, but the weight of their score depends on their reputation. That way, on my CV I can write down all of my papers along with their scores, so that others can quickly glance at my CV and get an idea of how important my research is. This is the basic idea but obviously there are a lot of details so that one cannot game the system. I’ve spent some time thinking about this and I think that if you implement it carefully in enough you can make it work. Timothy Gowers also seems to think so and he has thought about many of the details. If you are interested, please read his blog post, here.

One of the comments left on the Timothy Gowers’ blog post is that we might not want to turn life into a game, where reputation points mean everything. I really agree with this; some black-box is calculating my reputation on this website and the score output by this has serious consequences on my life in terms of career opportunities, etc. It makes academic life too much like a game, where everything I am trying to do revolves around increasing scores on my papers and increasing my reputation. So while I still think the system could work, it may not be what all academics want.

A less popular proposal
Gowers briefly talks about another idea, or at least an extension to his existing proposal that I think merits some additional discussion. The idea is this: anyone can start, edit, and curate their own online journal about whatever they want. They assemble a team of reviewers, who could be peers, friends, collaborators, or really anyone else they know. The editor of a journal and the team of reviewers is public information, and their reputation (not necessarily based on a scoring system) is what helps determine the quality of the journal. When I write a paper, I can submit it to one of these online journals, where it will go through the peer-review process, and possibly be accepted. Submissions and reviews can potentially be done anonymously, to allow for double-blind reviewing. Acceptance into someone’s online journal is a stamp of approval of a paper, and on my CV I would list which online journals my papers were accepted into. As in the other system, once a paper is accepted somewhere, maybe anyone should be free to comment on and score it.

There are several ways this system can account for journal quality/impact factor. A simple one is to use the editor’s and reviewers’ reputations as a proxy for the quality of the journal. Another is to allow journals to have reputation scores, based on the scores of that journal’s papers. This second solutions presents a startup problem, but I think you could bootstrap by using the first solution until the journal has a substantial number of articles. Also note that this same problem arises when I want to start a real journal. Again there are some details that need to be worked out but I do believe this sort of system could be made to work.

As a sort of aside, Journal of Machine Learning Research (JMLR) is an example of some of these ideas at work. The journal is open, providing free online access of all of its articles, and it still has a fairly high impact factor. In 2004, apparently it had the second ISI impact factor of any computer science journal (source). This small-scale experiment suggests that this sort of idea might actually work.

In Conclusion
If anyone reads this, I’d be interested to know what you think about these proposals. Do you see any serious complications/problems? Do you have any alternative proposals?

### Choosing a problem

Wednesday, September 14th, 2011

Yesterday I had a discussion with two of my friends about why we, as researchers, choose to work on the problems we work on. In statistics and machine learning, and maybe more generally in computer science, this can be a pretty interesting question. Do we work on problems purely for personal interest? Or do we require that problems have some well-thought-out practical application?

This is a philosophical question that I think is fairly unique to computer science and machine learning. In disciplines like biology, there is no question that applications are necessary motivation for a research problem. On the other hand, in pure math, it seems like applications are independent of problem choice; it’s nice if research is practice in nature, but it’s not necessary. CS and ML research is interesting, because in many respects it is only loosely motivated by application, and often it is more about developing theoretical results rather than demonstrating practicality. This is particularly true of the research that I have been doing recently in statistical machine learning.

The discussion went something like this: person A advocates working on things because they are interesting to him, while person B advocates having some grounding in reality before delving into a research problem. Person B at first claimed that the problems are interesting to the academic community only if they are practically motivated. As a counter, person B mentions the centuries of mathematical results (i.e. number theory, etc.) that only years later become “useful.” Clearly, these researchers were motivated by interest. Person B then asked why person A worked at all, and person B responded that he enjoys the problems he works on, mostly because they are interesting. The three of us then started talking about this in more detail, with focus on statistical machine learning problems.

Yesterday, we didn’t reach a conclusion on what was the right thing to do. I was tending toward choose problems based on interest. Assuming you have some sense of what has been going on in the community, your view on what is interesting and what isn’t should at least somewhat match the view of the community. If you feel a problem is interesting, my bets are that many other people feel that way, and for this reason it is a reasonable problem to work on.

I chatted with person B again today about this subject, and he brought up two interesting points. First, it is unlikely that you will get funding to work on problems that are not practically motivated. While this is probably true later in life, graduate school is exactly for this purpose. You get paid to work on the things that you find interesting, it doesn’t matter whether they’re well-motivated or not. Even later on, I think there is an art to making your work sound relevant and convincing people that it is worth their funding, and even with grants, I believe you have some flexibility to work on problems that interest you.

His second point was that working on obscure problems that no one cares about is not productive, even if you find it personally interesting. For example, if you take a well-known problem, modify it slightly, and rederive results for the modification, it isn’t interesting unless the modification you made is well motivated. I agreed with him on this point; I don’t really consider it research to walk through calculations for subtly different problems. There has to be something novel to the modified problem. On the other hand, if you took an obscure problem, and came up with some elegant, novel solution, that would be interesting and useful. Additionally, if you worked on an obscure problem, but came up with intermediary results that could be applicable elsewhere (for example new concentration inequalities), I would also say that is useful.

So after thinking about this for a couple of days here is where I stand. A research problem is worth working on (read: I would work on a problem) if: 1) you find it interesting, 2) it is practically motivated, 3) you believe that other researchers will be interested in your results (whether or not is it practically motivated), 4) You believe there are intermediary consequences, lemmas, or results that will be widely useful, or 5) You believe there is something elegant about the problem and its solution (i.e. the solution is not mechanical). To me (1) is necessary, and at least one of two through five are required.

### Doing Research Effectively

Friday, August 12th, 2011

I just submitted a research paper to a conference (INFOCOM) a couple of weeks ago and have been recently trying to figure out what I want take on as my next research project. It’s been a pretty rough couple of weeks, where I wasn’t really sure what I should be doing and I realized that this phase is what makes doing research challenging. In this post I wanted to talk a bit about my experiences over the last two weeks and some of my other thoughts on how to do research effectively. Since I work in the more theoretical side of machine learning, I realize that most of my accounts will be tailored to research in this field, and after thinking about this a bit more, some parts of this certainly won’t apply to other research areas (even within computer science).

Over the course of my last project, we came up with a bunch of ideas for small projects that we could work on after the submission, and the first place I went for inspiration was these projects. Unfortunately, many of these ideas were incremental changes to my existing project, i.e. ways to relax certain assumptions, ways to make the guarantees slightly stronger, etc. and I didn’t think they would make for substantial projects. I also had bunch of ideas that I had come up with when thinking about class project ideas for both of the machine learning classes I took last year. Some of these were terrible, some were interesting, but I didn’t really find anything that I was really excited about working on. I will emphasize here that it is a really good idea to write all of these things down when they come up, so that you don’t forget about things you want to do later on. The fact that I wrote down almost all of the ideas I had come up with over the last year was really helpful in finding a new project.

I’ve been working on problems related to network tomography, and to look for some inspiration I read a bunch of the newer papers in that field, along with some papers discussing problems in statistical learning, which is another area that I find really interesting. One of the ideas I had on my list was related to sparse coding, which is a pretty new idea to find simple representations of signals. I started reading a bunch of papers on sparse coding and kept my idea in the back of my head while reading, and eventually I figured out that my idea, as I had originally framed it months earlier, would not work. This was discouraging, but with some more thinking I found that a related formulation looked more promising. By promising I mean that I thought it would be interesting, and that I believe it solves an open problem. I decided that I might want to start working on this idea.

Before diving into the theoretical side of any problem, I like to run some simple simulations to see if the idea makes sense. In this context, I found an implementation of a related algorithm and modified it to see if my idea would work on some fairly trivial inputs. I like to do this as a sanity check, so I don’t waste my time proving things about algorithms that don’t work in practice. This I think is quite unique to machine learning, where the theory is mostly motivated by practice, meaning that no one will really care about how great your guarantees are if your method does not perform well at least on simulated data. At any rate, I didn’t want to start thinking about the guarantees of my algorithm without making sure that the it would work, and thats why I ran some simply simulations.

This is more or less where I am right now with my project, and the next step is the hard part: a theoretical analysis of the method. I don’t really have any clue about how to do this and honestly I find that I make progress in this direction at the most random times, such as when I’m in the shower. This reminds me of something that Manuel Blum told my algorithms class that I think is really interesting. He said, “I’m interested in where ideas come from?” I, for one, am also interested in this, but have no idea about the answer.

I think I kind of lost track of what I wanted to say in this article so I will summarize, and possibly add some new stuff here. When I’m evaluating whether a research idea is worthwhile or not I tend to ask the following questions (in order).

1. Is it well-motivated? Are there real-world applications where you would want to do this?
2. Is it novel? Has anyone else done it or related things before? (literature search)
3. Is it interesting to me?
4. Does it work, at least on simple examples? (simulations)
5. Can you prove that it works? What are the assumptions/restrictions and what guarantees can you make? How do these compare to related work?

Usually, if I can answer affirmatively to the first 4 questions, then I am willing to spend some time and figure out answers to the fifth question. I went through these steps for most of the ideas I had written down over the course of the last year. Many of them didn’t make it passed the first question, some involved a little more thought and a perusing through related work. This last idea so far has made it passed the first four barriers; the next step is the analysis.