NP-Completeness in the workplace
Tuesday, July 7th, 2009My algorithms professor used to tell his students (including me) this story to motivate studying NP-complete problems and reductions. The story goes something like this: say you’re working as a software developer and your boss gives you this project where you need to come up with a efficient way to do some computation. You go to your desk and think about for awhile and then suspect that there is no efficient way to do the computation. So you can’t just go back to your boss and say, “I think this problem is NP-Hard, so I give up,” you need to show your boss that it’s NP-Hard and this motivates the studying of reductions. It further motivates the study of approximation algorithms and other techniques to cope with NP-Completeness.
I’ve heard this story a couple of times from my professor, but I never really felt like this is something that would happen. How often is it that you are given a problem that doesn’t look intractable immediately, and how much convincing will your boss really need? For that matter how often, in a typical job, do you encounter intractable problems? Like how often does this really happen at a typically development job? Now I haven’t spent too much time in industry and really in my time I never ran into NP-Hard problems so I don’t really know. Although the story was motivational for an algorithms course, the situation used to seem implausible to me. That is until I was challenged to do this myself
This summer, I’m working in a computational genomics lab refining some algorithms for analyzing gene expression data. The core problem we are working on is NP-Hard, but in my refinements, I ran into another NP-Hard problem that I did not immediately see as intractable. I guess this story may be a little bit skewed because I am basically doing algorithms research but I still thought it was pretty cool.
My work deals with finding connected components in a large gene network (an undirected graph). To simplify things, we have a scoring function and want to find connected components with good scores and because of the scoring function this turns out to be NP-Hard, read this for the exact problem and why it’s NP-Hard (The exact problem is a generalization of set cover). Anyway, the basic way we do this is to pick a vertex with a “good” score (called the “seed”) and then expand the module from that node.
In our experimentation, we decided to see what would happen if we allowed seeds to be multiple disconnected nodes and grew our modules from these multi-node seeds. Basically, here you choose k nodes to be your seed and you grow the module out from these seeds, but it is not a requirement that the result is a connected component, in this case it is ok if the resulting module consists of k connected components. The motivation was that from a biological perspective, a disease may affect two or more gene pathways and we wanted our algorithms to be able to account for that, hence allowing multiple connected components.
I set out to implement this feature and was thinking about ways to create these multi-node seeds. I started out by filtering away all of the “bad” vertices and was left with a set of 30-50 good vertices but it was (and still is) not clear what’s a good way to combine these single seeds into multi-node seeds.
My first approach was to simply enumerate all of the combinations and then choose the ones where the individual vertices had good scores. This approach besides having exponential time complexity was bad in that the seeds we selected were all very similar. This was because the individual nodes that were very good were in all of the multi-node seeds we selected. Thus this approach was not satisfactory. Another problem with this approach was that we typically did not see disjoint connected components in the resulting solutions.
To remedy the last problem, we decided on a new approach where we select multi-node seeds where each individual vertex is far from the rest. We compute this by looking at the neighborhoods (typically only 1 level of breadth first search) around each seed and computing the intersection of these neighborhoods. So this seemed like a good approach and we were hopeful this would produce disjoint connected components in the results so I set out to implement this.
Lets restate the problem: Given a set of vertices in a graph, where each vertex has an associated set of elements from some universe U. The goal is to find k vertices that minimizes the intersection of the sets, or such that the intersection of these sets is the empty set.
I thought about this problem for awhile and was quite confident that it was intractable. Before talking to the grad student I’m working with about it, I wanted to have at least a sketch of a reduction in mind. My first reaction was that I should be able to reduce set cover to this problem, possibly by “negating” all of the sets or something. I messed with this for awhile and I didn’t really get anywhere.
I mentioned this problem to some other people in my lab and we all thought about it for awhile. Eventually one guy found that he could reduce independent set to this problem quite easily. If you instead go back to the gene network graph and consider each of the “elements of the universe” as neighbors of the vertices in the graph, then you immediately see that this is the independent set problem. If you treat each neighborhood as a node in a graph, and there is a edge between two neighborhoods if they share an element in common, then you can solve this problem by running an independent set solver on this graph. Conversely, given a graph, you can define a universe of elements and a set to each vertex such that vertices that are not connected share no elements and vertices that are connected share some elements. Then you can look for multi-node seeds and solve independent set. The way I explained the problem above, and the way that I thought about it made it difficult to see that it was independent set, but if you go back to the original problem, it’s easier to see.
Now that I new the problem was NP-Hard, I started thinking about ways to get good enough multi-node seeds and eventually settled on a greedy algorithm that grows the seed sets by adding the node with minimal intersection. The resulting seeds are decent and we did start seeing solutions with multiple disjoint components which is exactly what we were looking for. Despite being faced with a “challenging” problem, we did end up finding a good working solution that gave us the results that we wanted.
This little anecdote, although quite silly in retrospect (because the reduction was pretty straightforward), really showed me that what I learned about reductions and NP-Completeness is important from a practical perspective. It also showed me that my professors story was actually quite plausible. Plus it was another situation where I got to apply things that I learned, and I always find that really cool.