Posts Tagged ‘research’

Some thoughts on Academic Publishing

Saturday, November 26th, 2011

With the wide-spread adoption of the internet, the traditional academic publishing system has become somewhat antiquated. This has caused a lot of uproar within the academic communities, and many prominent researchers have been thinking about alternative publishing systems. There’s a lot of material about this floating around the internet, but in this article I will outline some of my thoughts and ideas.

The Problem
If you are familiar with the problem, you might want to skip down to the next section where I talk about some proposals that I and others have thought about.

To start out: What is wrong with the current system? It is actually quite complicated, but the main idea is that publishers (Elsevier, Springer, etc.) no longer seem to be adding any value while continuing to exorbitantly charge both authors and readers. Traditionally, the role of the publisher was to aid in distribution of academic material, and when this was legitimately a service, I completely understand them charging for it. However, now that almost all content can be obtained electronically, the role of distributor is no longer necessary. Yet publishers continue to charge ridiculous fees for journal subscriptions, which are required for an institution to obtain even electronic access to journal articles. I remember reading somewhere that university libraries spend the majority of their budget on journal subscriptions.

So why do researchers continue to publish in these journals? Well it is well known that academia is instilled with a publish-or-perish mentality, and moreover the specific venue in which you publish influences how your peers regard your work. Journals are scored by impact factor and publishing in journals with high-impact factor indicates that I am a good researcher. The quality of journal in which I publish plays a significant role in hiring decisions and other career opportunities and this, at least to me, is the primary reason why researchers continue to submit to these closed journals. There are some other factors, that motivate researchers to publish in journals, such as the peer-review system and the fact that publication is a sanity check that the work is correct and reasonable. However, I think the main motivation is to demonstrate one’s research ability. Noam Nisan talks about some other reasons and more details about this problem here.

To summarize, as it stands, the publishers provide no real value, but they restrict access to the elite journals. This motivates researchers to stick with their clearly flawed system. If we could introduce an open system to score and critique papers, with a mechanism for recognizing outstanding papers, it seems like we could break free of the existing system.

A Popular Solution
One popular solution to do this is a combination of Reddit and the ArXiv for academics. Researchers can post their papers online and then other people can leave comments and reviews of the paper. Everyone has a reputation score and the influence of one’s comments depends on their reputation. Maybe papers can get assigned scores, so anyone can score the paper, but the weight of their score depends on their reputation. That way, on my CV I can write down all of my papers along with their scores, so that others can quickly glance at my CV and get an idea of how important my research is. This is the basic idea but obviously there are a lot of details so that one cannot game the system. I’ve spent some time thinking about this and I think that if you implement it carefully in enough you can make it work. Timothy Gowers also seems to think so and he has thought about many of the details. If you are interested, please read his blog post, here.

One of the comments left on the Timothy Gowers’ blog post is that we might not want to turn life into a game, where reputation points mean everything. I really agree with this; some black-box is calculating my reputation on this website and the score output by this has serious consequences on my life in terms of career opportunities, etc. It makes academic life too much like a game, where everything I am trying to do revolves around increasing scores on my papers and increasing my reputation. So while I still think the system could work, it may not be what all academics want.

A less popular proposal
Gowers briefly talks about another idea, or at least an extension to his existing proposal that I think merits some additional discussion. The idea is this: anyone can start, edit, and curate their own online journal about whatever they want. They assemble a team of reviewers, who could be peers, friends, collaborators, or really anyone else they know. The editor of a journal and the team of reviewers is public information, and their reputation (not necessarily based on a scoring system) is what helps determine the quality of the journal. When I write a paper, I can submit it to one of these online journals, where it will go through the peer-review process, and possibly be accepted. Submissions and reviews can potentially be done anonymously, to allow for double-blind reviewing. Acceptance into someone’s online journal is a stamp of approval of a paper, and on my CV I would list which online journals my papers were accepted into. As in the other system, once a paper is accepted somewhere, maybe anyone should be free to comment on and score it.

There are several ways this system can account for journal quality/impact factor. A simple one is to use the editor’s and reviewers’ reputations as a proxy for the quality of the journal. Another is to allow journals to have reputation scores, based on the scores of that journal’s papers. This second solutions presents a startup problem, but I think you could bootstrap by using the first solution until the journal has a substantial number of articles. Also note that this same problem arises when I want to start a real journal. Again there are some details that need to be worked out but I do believe this sort of system could be made to work.

As a sort of aside, Journal of Machine Learning Research (JMLR) is an example of some of these ideas at work. The journal is open, providing free online access of all of its articles, and it still has a fairly high impact factor. In 2004, apparently it had the second ISI impact factor of any computer science journal (source). This small-scale experiment suggests that this sort of idea might actually work.

In Conclusion
If anyone reads this, I’d be interested to know what you think about these proposals. Do you see any serious complications/problems? Do you have any alternative proposals?

Choosing a problem

Wednesday, September 14th, 2011

Yesterday I had a discussion with two of my friends about why we, as researchers, choose to work on the problems we work on. In statistics and machine learning, and maybe more generally in computer science, this can be a pretty interesting question. Do we work on problems purely for personal interest? Or do we require that problems have some well-thought-out practical application?

This is a philosophical question that I think is fairly unique to computer science and machine learning. In disciplines like biology, there is no question that applications are necessary motivation for a research problem. On the other hand, in pure math, it seems like applications are independent of problem choice; it’s nice if research is practice in nature, but it’s not necessary. CS and ML research is interesting, because in many respects it is only loosely motivated by application, and often it is more about developing theoretical results rather than demonstrating practicality. This is particularly true of the research that I have been doing recently in statistical machine learning.

The discussion went something like this: person A advocates working on things because they are interesting to him, while person B advocates having some grounding in reality before delving into a research problem. Person B at first claimed that the problems are interesting to the academic community only if they are practically motivated. As a counter, person B mentions the centuries of mathematical results (i.e. number theory, etc.) that only years later become “useful.” Clearly, these researchers were motivated by interest. Person B then asked why person A worked at all, and person B responded that he enjoys the problems he works on, mostly because they are interesting. The three of us then started talking about this in more detail, with focus on statistical machine learning problems.

Yesterday, we didn’t reach a conclusion on what was the right thing to do. I was tending toward choose problems based on interest. Assuming you have some sense of what has been going on in the community, your view on what is interesting and what isn’t should at least somewhat match the view of the community. If you feel a problem is interesting, my bets are that many other people feel that way, and for this reason it is a reasonable problem to work on.

I chatted with person B again today about this subject, and he brought up two interesting points. First, it is unlikely that you will get funding to work on problems that are not practically motivated. While this is probably true later in life, graduate school is exactly for this purpose. You get paid to work on the things that you find interesting, it doesn’t matter whether they’re well-motivated or not. Even later on, I think there is an art to making your work sound relevant and convincing people that it is worth their funding, and even with grants, I believe you have some flexibility to work on problems that interest you.

His second point was that working on obscure problems that no one cares about is not productive, even if you find it personally interesting. For example, if you take a well-known problem, modify it slightly, and rederive results for the modification, it isn’t interesting unless the modification you made is well motivated. I agreed with him on this point; I don’t really consider it research to walk through calculations for subtly different problems. There has to be something novel to the modified problem. On the other hand, if you took an obscure problem, and came up with some elegant, novel solution, that would be interesting and useful. Additionally, if you worked on an obscure problem, but came up with intermediary results that could be applicable elsewhere (for example new concentration inequalities), I would also say that is useful.

So after thinking about this for a couple of days here is where I stand. A research problem is worth working on (read: I would work on a problem) if: 1) you find it interesting, 2) it is practically motivated, 3) you believe that other researchers will be interested in your results (whether or not is it practically motivated), 4) You believe there are intermediary consequences, lemmas, or results that will be widely useful, or 5) You believe there is something elegant about the problem and its solution (i.e. the solution is not mechanical). To me (1) is necessary, and at least one of two through five are required.

Doing Research Effectively

Friday, August 12th, 2011

I just submitted a research paper to a conference (INFOCOM) a couple of weeks ago and have been recently trying to figure out what I want take on as my next research project. It’s been a pretty rough couple of weeks, where I wasn’t really sure what I should be doing and I realized that this phase is what makes doing research challenging. In this post I wanted to talk a bit about my experiences over the last two weeks and some of my other thoughts on how to do research effectively. Since I work in the more theoretical side of machine learning, I realize that most of my accounts will be tailored to research in this field, and after thinking about this a bit more, some parts of this certainly won’t apply to other research areas (even within computer science).

Over the course of my last project, we came up with a bunch of ideas for small projects that we could work on after the submission, and the first place I went for inspiration was these projects. Unfortunately, many of these ideas were incremental changes to my existing project, i.e. ways to relax certain assumptions, ways to make the guarantees slightly stronger, etc. and I didn’t think they would make for substantial projects. I also had bunch of ideas that I had come up with when thinking about class project ideas for both of the machine learning classes I took last year. Some of these were terrible, some were interesting, but I didn’t really find anything that I was really excited about working on. I will emphasize here that it is a really good idea to write all of these things down when they come up, so that you don’t forget about things you want to do later on. The fact that I wrote down almost all of the ideas I had come up with over the last year was really helpful in finding a new project.

I’ve been working on problems related to network tomography, and to look for some inspiration I read a bunch of the newer papers in that field, along with some papers discussing problems in statistical learning, which is another area that I find really interesting. One of the ideas I had on my list was related to sparse coding, which is a pretty new idea to find simple representations of signals. I started reading a bunch of papers on sparse coding and kept my idea in the back of my head while reading, and eventually I figured out that my idea, as I had originally framed it months earlier, would not work. This was discouraging, but with some more thinking I found that a related formulation looked more promising. By promising I mean that I thought it would be interesting, and that I believe it solves an open problem. I decided that I might want to start working on this idea.

Before diving into the theoretical side of any problem, I like to run some simple simulations to see if the idea makes sense. In this context, I found an implementation of a related algorithm and modified it to see if my idea would work on some fairly trivial inputs. I like to do this as a sanity check, so I don’t waste my time proving things about algorithms that don’t work in practice. This I think is quite unique to machine learning, where the theory is mostly motivated by practice, meaning that no one will really care about how great your guarantees are if your method does not perform well at least on simulated data. At any rate, I didn’t want to start thinking about the guarantees of my algorithm without making sure that the it would work, and thats why I ran some simply simulations.

This is more or less where I am right now with my project, and the next step is the hard part: a theoretical analysis of the method. I don’t really have any clue about how to do this and honestly I find that I make progress in this direction at the most random times, such as when I’m in the shower. This reminds me of something that Manuel Blum told my algorithms class that I think is really interesting. He said, “I’m interested in where ideas come from?” I, for one, am also interested in this, but have no idea about the answer.

I think I kind of lost track of what I wanted to say in this article so I will summarize, and possibly add some new stuff here. When I’m evaluating whether a research idea is worthwhile or not I tend to ask the following questions (in order).

  1. Is it well-motivated? Are there real-world applications where you would want to do this?
  2. Is it novel? Has anyone else done it or related things before? (literature search)
  3. Is it interesting to me?
  4. Does it work, at least on simple examples? (simulations)
  5. Can you prove that it works? What are the assumptions/restrictions and what guarantees can you make? How do these compare to related work?

Usually, if I can answer affirmatively to the first 4 questions, then I am willing to spend some time and figure out answers to the fifth question. I went through these steps for most of the ideas I had written down over the course of the last year. Many of them didn’t make it passed the first question, some involved a little more thought and a perusing through related work. This last idea so far has made it passed the first four barriers; the next step is the analysis.

Capsules: Designing Web Applications For Review

Sunday, May 2nd, 2010

I just came back from WWW 2010 where I presented a research project I’ve been working on for the majority of my undergraduate years. The project is about building web applications with high level security properties that can be verified in a code review. This post is about the project and why I think it’s really cool.

I’d like to start off by convincing you that we need web applications with verifiable high-level security properties. First, what is a high level security property? It’s an application-specific guarantee about the privacy and integrity of user data maintained by the application. For example, as a user of an online banking application, I’d like some assurance that I cannot lose money unless I have authorized a transaction. I’d also like some guarantee that only I can view my account balances and transaction histories. These are high-level properties regarding the integrity and privacy of my data. And these are the kinds of guarantees that we want with our web applications; it’s not enough to just defend an app from XSS and CSRF.

Since these properties can be violated by the application itself (as opposed to external attackers), we have to make sure that the application does not violate them. This requires a code review. Unfortunately, with state-of-the-art technologies, and with sophisticated, complex applications, these code review are incredibly challenging. I would argue that they are infeasible. Why? Because verifying that a high-level property is achieved involves an exhaustive review of the application. With current application architectures, every object has the privileges necessary to violate whatever security property we’re interested in. In order guarantee that the application satisfies a property, we therefore have to make sure that the entire application does not violate it. Since applications are enormous, this is very challenging.

My project looked at making these code review easier by partitioning an application into components and granting only specific privileges to each component. Partitioning an application and exposing limited privileges facilitates a code review because now only parts of the application have the privileges needed to violate any particular security property. Auditors may not even have to look at certain application modules because they can guarantee a priori that those modules cannot violate the properties we are interested in.

All we’re doing here is bringing the idea of least privilege to web applications. We used an object capability approach to achieving least privilege in application components. Our goal was to confine each application component to a reduced-privilege context. We took a multi-faceted approach. First, we prevented application components from constructing additional privilege. We did this by requiring that applications are written in an object-capability language (in our case Joe-E). Second, we prevented the application from maintaining state outside of a semi-persistent session object (By semi-persistent, I mean that it lives in memory but is maintained across multiple HTTP requests). Combined, these two properties imply that all privileges to user data and resources must reside within the session object. Finally, we use wrapper objects to expose only a subset of the session object to each application component. This effectively confines each application component into a reduced-privilege context.

In terms of implementation, we built Capsules, a prototype framework that extends the Java Servlet Framework with these ideas. As mentioned, we require that applications are written in Joe-E, an object capability subset of Java. We use several Joe-E features to achieve the three aspects of our approach. First, Joe-E prevents objects from constructing privileges from scratch. Secondly, Joe-E allows us to declare application components (called Servlets) as immutable, which, in short, means that they cannot maintain state. Finally, Joe-E allows us to construct wrapper objects that actually encapsulate their internal state, so that Servlets must go through the interfaces exposed by the wrapper rather than using reflection to obtain a reference to the underlying session object. In this way, Joe-E helps us establish these reduced-privilege contexts.

We also conducted an evaluation of this framework by building a simple web mail application and verifying that the application maintains the privacy and integrity of user mailboxes. In this analysis, we discovered that there were several application components that we could completely ignore, simply because they had no way to violate the privacy and integrity properties. While our application was simple, we believe that this kind of analysis will also apply to more sophisticated applications, making it more practical to review these kinds of high-level properties.

So that’s a overview of the Capsules project. I’ve ommited most of the technical details so that I could concisely convey the main points. If you are interested, I encourage you to read our paper or see the slides for my talk (although I don’t think the slides will be very helpful apart from the pretty pictures). Finally, please feel free to contact me if you have questions or are interested in talking to me about the project.

NP-Completeness in the workplace

Tuesday, July 7th, 2009

My algorithms professor used to tell his students (including me) this story to motivate studying NP-complete problems and reductions. The story goes something like this: say you’re working as a software developer and your boss gives you this project where you need to come up with a efficient way to do some computation. You go to your desk and think about for awhile and then suspect that there is no efficient way to do the computation. So you can’t just go back to your boss and say, “I think this problem is NP-Hard, so I give up,” you need to show your boss that it’s NP-Hard and this motivates the studying of reductions. It further motivates the study of approximation algorithms and other techniques to cope with NP-Completeness.

I’ve heard this story a couple of times from my professor, but I never really felt like this is something that would happen. How often is it that you are given a problem that doesn’t look intractable immediately, and how much convincing will your boss really need? For that matter how often, in a typical job, do you encounter intractable problems? Like how often does this really happen at a typically development job? Now I haven’t spent too much time in industry and really in my time I never ran into NP-Hard problems so I don’t really know. Although the story was motivational for an algorithms course, the situation used to seem implausible to me. That is until I was challenged to do this myself

This summer, I’m working in a computational genomics lab refining some algorithms for analyzing gene expression data. The core problem we are working on is NP-Hard, but in my refinements, I ran into another NP-Hard problem that I did not immediately see as intractable. I guess this story may be a little bit skewed because I am basically doing algorithms research but I still thought it was pretty cool.

My work deals with finding connected components in a large gene network (an undirected graph). To simplify things, we have a scoring function and want to find connected components with good scores and because of the scoring function this turns out to be NP-Hard, read this for the exact problem and why it’s NP-Hard (The exact problem is a generalization of set cover). Anyway, the basic way we do this is to pick a vertex with a “good” score (called the “seed”) and then expand the module from that node.

In our experimentation, we decided to see what would happen if we allowed seeds to be multiple disconnected nodes and grew our modules from these multi-node seeds. Basically, here you choose k nodes to be your seed and you grow the module out from these seeds, but it is not a requirement that the result is a connected component, in this case it is ok if the resulting module consists of k connected components. The motivation was that from a biological perspective, a disease may affect two or more gene pathways and we wanted our algorithms to be able to account for that, hence allowing multiple connected components.

I set out to implement this feature and was thinking about ways to create these multi-node seeds. I started out by filtering away all of the “bad” vertices and was left with a set of 30-50 good vertices but it was (and still is) not clear what’s a good way to combine these single seeds into multi-node seeds.

My first approach was to simply enumerate all of the combinations and then choose the ones where the individual vertices had good scores. This approach besides having exponential time complexity was bad in that the seeds we selected were all very similar. This was because the individual nodes that were very good were in all of the multi-node seeds we selected. Thus this approach was not satisfactory. Another problem with this approach was that we typically did not see disjoint connected components in the resulting solutions.

To remedy the last problem, we decided on a new approach where we select multi-node seeds where each individual vertex is far from the rest. We compute this by looking at the neighborhoods (typically only 1 level of breadth first search) around each seed and computing the intersection of these neighborhoods. So this seemed like a good approach and we were hopeful this would produce disjoint connected components in the results so I set out to implement this.

Lets restate the problem: Given a set of vertices in a graph, where each vertex has an associated set of elements from some universe U. The goal is to find k vertices that minimizes the intersection of the sets, or such that the intersection of these sets is the empty set.

I thought about this problem for awhile and was quite confident that it was intractable. Before talking to the grad student I’m working with about it, I wanted to have at least a sketch of a reduction in mind. My first reaction was that I should be able to reduce set cover to this problem, possibly by “negating” all of the sets or something. I messed with this for awhile and I didn’t really get anywhere.

I mentioned this problem to some other people in my lab and we all thought about it for awhile. Eventually one guy found that he could reduce independent set to this problem quite easily. If you instead go back to the gene network graph and consider each of the “elements of the universe” as neighbors of the vertices in the graph, then you immediately see that this is the independent set problem. If you treat each neighborhood as a node in a graph, and there is a edge between two neighborhoods if they share an element in common, then you can solve this problem by running an independent set solver on this graph. Conversely, given a graph, you can define a universe of elements and a set to each vertex such that vertices that are not connected share no elements and vertices that are connected share some elements. Then you can look for multi-node seeds and solve independent set. The way I explained the problem above, and the way that I thought about it made it difficult to see that it was independent set, but if you go back to the original problem, it’s easier to see.

Now that I new the problem was NP-Hard, I started thinking about ways to get good enough multi-node seeds and eventually settled on a greedy algorithm that grows the seed sets by adding the node with minimal intersection. The resulting seeds are decent and we did start seeing solutions with multiple disjoint components which is exactly what we were looking for. Despite being faced with a “challenging” problem, we did end up finding a good working solution that gave us the results that we wanted.

This little anecdote, although quite silly in retrospect (because the reduction was pretty straightforward), really showed me that what I learned about reductions and NP-Completeness is important from a practical perspective. It also showed me that my professors story was actually quite plausible. Plus it was another situation where I got to apply things that I learned, and I always find that really cool.