Show Me the Data
April 27, 2010 1 Comment
One of my friends recently pointed me to this post about network data. The author states that one of the things he will miss the most about working at Google is the access to the tremendous amount of data that the company collects.
Although I have not worked at Google and can only imagine the treasure trove their employees must have, I have also spent time with lots of sensitive data during my time at AT&T Research Labs. At AT&T, we had—and researchers still presumably have—access to a font of data, ranging from router configurations to routing table dumps to traffic statistics of all kinds. I found having direct access to this kind of data tremendously valuable: it allowed me to “get my hands dirty” and play with data as I explored interesting questions that might be hiding in the data itself. During that summer, I developed a taste for working on real, operational problems.
Unfortunately, when one retreats to the ivory towers, one cannot bring the data along for the ride. Sitting back at my desk at MIT, I realized there were a lot of problems with network configuration management and wanted to build tools to help network operators run their networks better. One of these tools was the “router configuration checker” (rcc), which has been downloaded and used by hundreds of ISPs to check their routing configurations for various kinds of errors. The road to developing this tool was tricky: it required knowing a lot about how network operators configure their networks, and more importantly direct access to network configurations on which to debug the tool. I found myself in a catch-22 situation: I wanted to develop a tool that was useful for operators, but I needed operators to give me data to develop the tool in the first place.
My most useful mentor at this juncture was Randy Bush, a research-friendly operator who told me something along the following lines: Everyone wants data, but nobody knows what they’re going to do with it once they get it. Help the operators solve a useful problem, and they will give you data.
This advice could not have been more sage.
I went to meetings of the North American Network Operators Group (NANOG) and talked about the basic checks I had managed to bootstrap into some scripts using data I had from MIT and a couple other smaller networks (basically, enough to test that the tool worked on Cisco and Juniper configurations). At NANOG, I met a lot of operators who seemed interested in the tool and were willing to help—often they would not provide me with their configurations, but they would run the tool for me and tell me the output (and whether or not the output made sense). Guy Tal was another person who I owe a lot of gratitude for his patience in this regard. Sometimes, I got lucky and even got a hold of some configurations to stare at.
Before I knew it, I had a tool that could run on large Internet Service Provider (ISP) configurations and give operators meaningful information about their networks, and hundreds of ISPs were using the tool. And, I think that when I gave my job talk, people from other areas may not have understood the details of “BGP”, or “route oscillations”, or “route hijacks”, but they certainly understood that ISPs were actually using the tool.
We applied the same approach when we started working on spam filtering. We wrote an initial paper that studied the network-level behavior of spammers with some data we were able to collect at a local “spam trap” on the MIT campus (more on that project in a later post). The visibility of that work (and its unique approach, which spawned a lot of follow-on work) allowed us to connect with people in industry who were working on spam filtering, had real problems that needed solving, and had data (and, equally importantly, expertise) to help us think about the problems and solutions more clearly.
In these projects (as well as other more recent ones), I see a pattern in how one can get access to “real data”, even in academia. Roughly, here is some advice:
- Have a clear, practical problem or question in mind. Do not simply ask for data. Everyone asks for data. A much more select set is actually capable of doing something useful with it. Demonstrate that you have given some thought to questions you want to answer, and think about whether anyone else might be interested in those questions. Importantly, think about whether the person you are asking for data might be interested in what you have to offer.
- Be prepared to work with imperfect data. You may not get exactly the data you would like. For example, the router configurations or traffic traces might be partially anonymized. You may only get metadata about email messages, as opposed to full payloads. (And so on.) Your initial reaction might be to think that all is lost without the “perfect dataset”. This is rarely the case! Think about how you can either adjust your model, or adapt your approach (or even the question itself) with imperfect data.
- Be prepared to operate blindly. In many cases, operators (or other researchers) cannot give you raw data that they have access to; often, data may be sensitive, or protected by non-disclosure agreements. However, these people can sometimes run analysis on the data for you, if you are nice to them, and if you write the analysis code in a way that they can easily run your scripts.
- Bring something to the table. This goes back to Randy Bush’s point. If you make yourself useful to operators (or others with data), they will want to work with you—if you are asking an interesting question or providing something useful, they might be just as interested in the answers as you are.
There is much more to say about networking research and data. Sometimes it is simply not possible to get the data one needs to solve interesting research problems (e.g., pricing data is very difficult to obtain). Still, I think as networking researchers, we should be first looking for interesting problems and then looking for data that can help us solve those problems; too often, we operate in reverse, like the drunk who looks for his keys under the lamppost because it is brighter where the light is shining. I’ll say more about this in a later post.
I really like this post, great advice… although I work in a very different area, my own experiences in industrial research and working with people on the front lines in telecoms echoes this. They’re happy to participate (they generated data for me) but they want to know what they will get in return. And of course if it’s a tool, I’m guessing that another crucial feature of working in this mode is iteration, as access to increasingly large/complex data sets continues to define and redefine the nature of the actual problem being solved.