The Visible Effects and Hidden Sources of Internet Latency

Most Internet Service Providers advertise their performance in terms of downstream throughput.  The “speed” that one pays for reflects, effectively, the number of bits per second that can be delivered on the access link into your home network.  Although this metric makes sense for many applications, it is only one characteristic of network performance that ultimately affects a user’s experience.  In many cases, latency can be at least as important as downstream throughput.

For example, consider the figure below, which shows Web page load times as downstream throughput increases—the time to load many Web pages decreases as throughput increases, but downstream throughput that is faster than about 16 Mbps stops having any effect on Web page load time.

web-plt

Page load times decrease with downstream throughput, but only up to 8–16 Mbits/s.

The culprit is latency: For short, small transfers (as is the case with many Web objects), the time to initiate a TCP connection and open the initial congestion window is dominated by the round-trip time between the client and the Web server.  In other words, the size of the access link no longer matters because TCP cannot increase its sending rate to “fill the pipe” before the connection has completed.

The role of latency in Web performance is no secret to anyone who has spent time studying it, and many content providers including Google, Facebook, and others have spent considerable effort to reduce latency (Google has a project called “Make the Web Faster” that encompasses many of these efforts).  Latency plays a role in the time it takes to complete a DNS lookup, the time to initiate a connection to the server, and the time to increase TCP’s congestion window (indeed, students of networking will remember that TCP throughput is inversely proportional to the round-trip time between the client and the server).  Thus, as throughput continues to increase, network latency plays an increasingly predominant role in the performance of applications such as the Web.  Of course, latency also determines user experience for many latency-sensitive applications as well, including streaming voice, audio, video, and gaming.

The question, then, becomes how to reduce latency to the destinations that users commonly access.  Content providers such as Google and others have taken several approaches: (1) placing Web caches closer to users; (2) adjusting TCP’s congestion control mechanism to start sending at a faster rate for the first few round trips.  These steps, however, are only part of the story, because the network performance between the Web cache and the user may still suffer, for a variety of reasons:

  • First, factors such as bufferbloat and DSL interleaving can introduce significant latency effects in the last mile.  Our study from SIGCOMM 2011 showed how both access link configuration and a user’s choice of equipment (e.g., DSL modem) can significantly affect the latency that a user sess.

  • Second, a poor wireless network in the home can introduce significant latency effects; sometimes we see that 20% of the latency for real user connections from homes is within the home itself.

  • Finally, if the Web cache is not close to users in the first place (e.g., in the case of developing countries), the paths between the users and their destinations can still be subject to significant latency.  These factors can be particularly evident in developing countries, where poor peering and interconnection can result in long paths to content, and where the vast majority of users access the network through mobile and cellular networks.

In the Last Mile

In our SIGCOMM 2011 paper “Broadband Internet Performance: A View from the Gateway” (led by Srikanth Sundaresan and Walter de Donato), we pointed out several aspects of home networks that can contribute significantly to latency.  We define a metric called last-mile latency, which is the latency to the first hop inside the ISP’s network. This metric captures the latency of the access link.

We found in this study that last-mile latencies are often quite high, varying from about 10 ms to nearly 40 ms (ranging from 40–80% of the end-to-end path latency). Variance is also high. One might expect that variance would be lower for DSL, since it is not a shared medium like cable. Surprisingly, we found that the opposite was true: Most users of cable ISPs have last-mile latencies of 0–10 ms. On the other hand, a significant proportion of DSL users have baseline last-mile latencies more than 20 ms, with some users seeing last-mile latencies as high as 50 to 60 ms. Based on discussions with network operators, we believe DSL companies may be enabling an interleaved local loop for these users.  ISPs enable interleaving for three main reasons: (1) the user is far from the DSLAM; (2) the user has a poor quality link to the DSLAM; or (3) the user subscribes to “triple play” services. An interleaved last-mile data path increases robustness to line noise at the cost of higher latency. The cost varies between two to four times the baseline latency. Thus, cable providers in general have lower last-mile latency and jitter. Latencies for DSL users may vary significantly based on physical factors such as distance to the DSLAM or line quality.

dsl-latencies

Most users see latencies less than 10 ms, but there are a significant number of users with the last mile latency greater than 10 ms.

Customer provided equipment also plays a role.  Our study confirmed that excessive buffering is a widespread problem afflicting most ISPs (and the equipment they provide). We profile different modems to study how the problem affects each of them. We also see the possible effect of ISP policies, such as active queue and buffer management, on latency and loss.  For example, when measuring latency under load (the latency that a user experiences when the access link is saturated due to an upload or a download), we see more than an order of magnitude of difference between modems. The 2Wire modem we tested had the lowest worst-case last-mile latency, 800 ms. Motorola’s was about 1.6 seconds, and the Westell modem we tested had a worst case latency of more than 10 seconds.

modem-bufferbloat

Empirical measurements of modem buffering. Different modems have different buffer sizes, leading to wide disparities in observed latencies when the upstream link is busy.

Last-mile latency can also be high for particular technologies such as mobile.  In a recent study of fixed and mobile broadband performance in South Africa, we found that, although the mobile providers consistently offer higher throughput, the latency of mobile connections is often 2–3x higher than that of fixed-line connectivity in the country.

In the Home Wireless Network

Our recent study of home network performance (led by Srikanth Sundaresan) found that a home wireless network can also be a significant source of latency.  We have recently instrumented home networks with a passive monitoring tool that determines whether the access link or the home wireless network (or both) are potential sources of performance problems.  One of the features that we explored in that work was the TCP round-trip time between wireless clients in the home network and the wireless access point in the home.  In many cases, due to wireless contention or other sources of wireless bottlenecks, the TCP round-trip latency in home wireless networks was a significant portion of the overall round-trip latency.

We analyzed the performance of the home network relative to the wide-area network performance for distributions of real user traffic in for about 65 homes over the course of one month. We use these traces to compare the round-trip times between the devices and the access point to the round- trip times from the access point to the wide-area destination for each flow. We define the median latency ratio for a device as the median ratio of the LAN TCP round-trip time to the WAN TCP round-trip time across all flows for that device. The figure below shows the distribution of the median latency ratio across all devices. The result shows for 30% of devices in those homes, at least half of the flows have end-to-end latencies where the home wireless network contributes more than 20% of the overall end-to-end latency.  This technical report provides more details concerning the significant role that home wireless networks can play in end-user performance; a future post will explore this topic at length.

lan-rtts

The distribution of the median ratio of the LAN TCP round-trip time to the WAN TCP round-trip time across all flows for that device, across all devices.

Our findings of latency in home networks suggest that the RTT introduced by the wireless network may often be a significant fraction of the end-to-end RTT. This finding is particularly meaning- ful in light of the many recent efforts by service providers to reduce latency to end-to-end services with myriad opti- mizations and careful placement of content. We recommend that, in addition to the attention that is already being paid to optimizing wide-area performance and host TCP connection settings, operators should also spend effort to improve home wireless network performance.

In Developing Regions

Placing content in a Web cache has little effect if the users accessing the content still have high latency to those destinations.  A study of latency from fixed-line access networks in South Africa using BISmark data that was led by Marshini ChettySrikanth Sundarean, Sachit Muckaden, and Enrico Calandro in cooperation with Research ICT Africa showed that peering and interconnectivity within the country still has a long way to go: in particular, the plot below shows the average latency from 16 users of fixed-line access networks in South Africa to various Internet destinations.  The bars are sorted in order of increasing distance from Johannesburg, South Africa.  Notably, geographic distance from South Africa does not correlate with latency—the latency to Nairobi, Kenya is almost twice as much as the latency to London.  In our study, we found that users in South Africa experienced average round-trip latencies exceeding 200 ms to five of the ten most popular websites in South Africa: Facebook (246 ms), Yahoo (265 ms), LinkedIn (305 ms), Wikipedia (265 ms), and Amazon (236 ms). Many of these sites only have data centers in Europe and North America.

jnb-latencies

The average latencies to Measurement Lab servers around the world from South Africa. The numbers below each location reflect the distance from Johannesburg in kilometers, and the bars are sorted in order of increasing distance from Johannesburg.  Notably, latency does not increase monotonically with distance.

People familiar with Internet connectivity may not find this result surprising: indeed, many ISPs in South Africa connect to one another via the London Internet Exchange (LINX) or the Amsterdam Internet Exchange (AMS-IX) because it is cheaper to backhaul connectivity to exchange points in Europe than it is to connect directly at an exchange point on the African continent.  The reasons for this behavior appears to be both regulatory and economic, but more work is needed, both in deploying caches and improving Internet interconnectivity to reduce the latency that users in developing regions see to popular Internet content.

Advertisement

The Resilience of Internet Connectivity in East and South Africa: A Case Study

On March 27, 2013 at 6:20 a.m. UTC, the SeaMeWe-4 cable outage affected connectivity across the world.  SeaWeMe-4 is currently the largest submarine cable connecting Europe and Asia.  The Renesys blog recently covered the effect of this outage on various parts of Asia and Africa (Pakistan, Saudi Arabia, the UAE, etc.).  In this post, we explore how the fiber cut affected connectivity from other parts of the world, as visible from the BISmark home router deployment.  The credit for the data analysis in this blog post goes to Srikanth Sundaresan, one of Georgia Tech’s star Ph.D. students whose work on BISmark has garnered a number of awards.

Background: BISmark

The BISmark project has been deploying customized home gateways in home broadband access networks around the world for more than two years; we currently have more than 130 active home routers measuring the performance of access links in nearly 30 countries.  The high-level goal of the project is to gather information from inside home networks to help users and ISPs better debug their home networks.  Two years ago, we published the first paper using BISmark data in SIGCOMM.  The paper explores the performance of broadband access networks around the United States and has many interesting findings:

  • We showed how a technique called “interleaving” on DSL networks can introduce tends of milliseconds of additional latency on home networks.
  • We explored how a user’s choice of equipment can introduce “bufferbloat” effects on home access links.
  • We showed how technologies such as PowerBoost can also introduce sudden, dramatic increases in latency when interacting with buffering on the access link.

The image below shows the current deployment of BISmark.  We have more than 80 routers in North America, nearly 20 in Southeast Asia, about fifteen in the European Union, about 15 in South Africa, and about ten in East Asia.  You can explore the data from the deployment yourself on the Network Dashboard; all of the active measurements are available for download in raw XML format as they are collected.

The BISmark deployment as of May 28, 2013.

The BISmark deployment as of May 28, 2013.

Each BISmark router sits in a home broadband access network.  The routers are NetGear WNDR 3700 and 3800s; we ship routers to anyone who is interested in participating.  As an incentive for participating, you gain access to your own data on the network dashboard.  We are also actively seeking researchers and developers; please contact us below if you are interested, and feel free to check out the project GitHub page.

Every BISmark router measures latency to the Google anycast DNS service and to 10 globally distributed Measurement Lab servers every 10 minutes.  Those servers are located in Atlanta, Los Angeles, Amsterdam, Johannesburg, Nairobi, Tokyo, Sydney, New Delhi, and Rio de Janiero.

Effects of the SMW4 Fiber Cut: A Case Study

We first explore the effects of the fiber cut on reachability from the active BISmark routers to each of the Measurement Lab destinations.  At the time of the outage (6:20a UTC), the Measurement Lab server in Nairobi became completely unreachable for more than four hours.  The Nairobi Measurement Lab server is hosted in AS 36914 (KENet, the Kenyan Education Network).

Connectivity was restored at 10:34a UTC.  Interestingly, between 9a and 10a UTC, reachability from many of our other BISmark routers to all of the Measurement Lab destinations was affected.  We have not yet explored which of the BISmark routers experienced these reachability problems, but, as we explore further below, this connectivity blip coincides with some connectivity being restored to Kenya via Safaricom, the backup ISP for the Measurement Lab server hosted in KENet.  It is possible that other convergence events were also occurring at that time.

Reachability from BISmark routers to each of the Measurement Lab servers on March 27, 2013.

Reachability from BISmark routers to each of the Measurement Lab servers on March 27, 2013.

Analysis of the BGP routing table information from RouteViews shows that connectivity to AS 36914 was restored at 10:34a UTC. The following figure shows the latencies from all nodes to Nairobi before and after the outage. As soon as connectivity returns, the first set of latencies seem to be roughly the same as before, but latencies almost immediately increase to all destinations, except for a router situated in South Africa in AS 36937 (Neotel).  This result suggests that Neotel may have better connectivity to destinations within Africa than some other ISPs, and that access ISPs who use Neotel for “transit” may see better performance and reliability to destinations within the continent. Because only the SEACOM cable was affected by the cut, not the West African Cable System (WACS) or EASSy cable, Neotel’s access to other fiber paths may have allowed its users to sustain better performance after the fiber cut.

Latencies from BISmark routers in various regions to Naorobi, Kenya (AS 36914, Neotel).

Latencies from BISmark routers in various regions to Naorobi, Kenya (AS 36914, Neotel).

This incident—and Neotel’s relative resilience—suggests the importance of exploring the effects of undersea cable connectivity in various countries in Africa and how such connectivity affects resilience.  (In a future post, we will explore the effects of peering and ISP interconnectivity on the performance that users in this part of the world see.)   

Internet Routing to KENet during the Outage

6:20a: The Fiber Cut. The reachability and performance effects caused by the SWM4 fiber cut beg the question of what was happening to routes to Kenya (and, in particular KENet) at the time of the outage.  We explore this in further detail below.  The first graph below shows reachability to KENet (AS 36914, the large red dot) at 6:20:50 UTC, around which time the fiber cut occurred.  The second plot shows the routes at 6:23:51 UTC; by 6:27:06 UTC, AS 36914 became completely unreachable.

Internet reachability to KENet (AS 36914) at 6:20:50a, 6:23:51a, and 6:27:06a UTC.

Internet reachability to KENet (AS 36914) at 6:20:50a, 6:23:51a, and 6:27:06a UTC.

9:05a: Connectivity is (partially) restored through a backup path. About two-and-a-half hours later, at 9:05:49 UTC, AS 36914 starts to come back online, and connectivity is restored within about one minute, although all Internet paths to this destination go through AS 33771 (SafariCom), which is most likely KENet’s backup (i.e., commercial, and hence more expensive) provider.  This is an interesting example of BGP routing and backup connectivity in action: Many ISPs such as KENet have primary and backup Internet providers, and paths only go through the backup provider (in this case, SafariCom) when the primary path fails.  

Connectivity to KENet (AS36914) is restored, via the commercial backup provider, SafariCom (AS 33771).  It is interesting to note that although connectivity was restored at 9:06a through this backup path, the server hosted in this network was still unreachable until paths switched back to the primary provider (UbutuNet) at 10:34a.

Connectivity to KENet (AS36914) is restored, via the commercial backup provider, SafariCom (AS 33771). It is interesting to note that although connectivity was restored at 9:06a through this backup path, the server hosted in this network was still unreachable until paths switched back to the primary provider (UbutuNet) at 10:34a.

Note that although connectivity to KENet was restored through SafariCom at around 9:06a UTC, none of the BISmark routers could reach the Measurement Lab server hosted in KENet through this backup path!  This pathology suggests that the failover didn’t really work as planned, for some reason.  Although this disconnection could result from poor Internet “peering” between SafariCom and the locations of our BISmark routers around the world, it is unlikely that bad peering would affect reachability to all of our destinations.  Still it is not clear why the connectivity through SafariCom was not sufficient to restore connectivity to at least some of the BISmark nodes.  The connectivity issue we observed could be something mundane (e.g., SafariCom simply blocks ICMP “ping” packets), or it could be something much more profound.

It is also interesting to note that Internet routing took more than two hours to restore!  Usually, we think of Internet routing as being dynamic, automatically reconverging when failures occur to find a new working path (assuming one exists).  While BGP has never been known for being zippy, two-and-a-half hours seems excessive.  It is perhaps more likely that some additional connectivity arrangements were being made behind the scenes; it might even be the case that KENet purchased additional backup connectivity (or made special arrangements) during those several hours when they were offline.

10:35a: Connectivity returns through the primary path.  At around 10:34a UTC, routes to KENet begin reverting to the primary path, as can be seen in the left figure below.  By 10:35a UTC, everything is “back to normal” as far as BGP routing is concerned although as we saw above, latencies remain high to most destinations for an additional eight hours.  It is unclear what causes latencies to remain high after latencies were restored, but this offers another important lesson: BGP connectivity does not equate to good performance through those BGP paths.  This underscores the importance of using both BGP routing tables and a globally distributed performance measurement platform like BISmark to understand performance and connectivity issues around the times of outages.

By 10:35a UTC, connectivity is restored through UbuntuNet (AS 36944), KENet's primary provider.  Once BGP convergence begins, it takes only a little more than a minute for paths to revert to the primary path.

By 10:35a UTC, connectivity is restored through UbuntuNet (AS 36944), KENet’s primary provider. Once BGP convergence begins, it takes only a little more than a minute for paths to revert to the primary path.

Takeaway Lessons

It’s worthwhile to reflect on some of the lessons from this incident; it teaches us about how Internet routing works (and doesn’t work), about the importance of backup paths, and about the importance of performing joint analysis of both routing information and active performance measurements from a variety of globally distributed locations.  I’ve summarized a few of these below:

  • Peering and interconnectivity in Africa haven’t yet come of age.  It is clear from this incident that certain locations in Africa (although not all) are not particularly resilient to fiber cuts.  The SWM4 fiber cut took KENet completely offline for several hours, and even after connectivity was “restored” several hours later, many locations still could not reach the destination through the backup path.  Certain ISPs in Africa that are better connected (e.g., Neotel, and the Measurement Lab node hosted in TENET in Johannesburg) weathered the fiber cut much better than others, most likely because they have backup connectivity through WACS or EASSy.  In a future post, we will explore performance issues in various parts of Africa that likely result from poor peering.
  • Connectivity does not imply good performance.  Even after connectivity was completely “restored” (at least according to BGP), latencies to Nairobi from most regions remained high for almost another eight hours.  This disparity underscores the importance of not relying solely on BGP routing information to understand the quality of connectivity to and from various Internet destinations.  Global deployments like BISmark are critical for getting a more complete picture of performance.
  • “Dynamic routing” isn’t always dynamic.  The ability for dynamic routing protocols to find a working backup path depend on the existence of those paths in the first place.  The underlying physical connectivity must be there, and the business arrangements (peering) between ISPs must exist to allow those paths to exist (and function) when failures do occur.  Something occurred on March 27, 2013 that exposed a glaring hole in the Internet’s ability to respond dynamically to a failure.  It would be very interesting to learn more about what happened between 6:20a UTC and 9:05a UTC to learn more about exactly what resulted in connectivity being restored (via SafariCom), and why it took so long.  Perhaps we need more sophisticated “what if” tools that help ISPs better evaluate their readiness for these types of events.

In future posts, we will continue to explore how BISmark can help expose pathologies that result from disconnections, outages, and other pathologies.  Our ability to perform this type of analysis depends on the continued support of ISPs, users, and the broader community.  We encourage you to contact us using the form below if you are interested in hosting a BISmark router in your access network.  (You can also post public comments at the bottom of the page, below the contact form.)

It’s 10 p.m. Do You Know Where Your Passwords Are?

Have you ever wondered how many sites have your credit card number?  Or, have you ever wondered how many sites have a certain version of your password?  Do you think you might have reused the password you have used on your banking Web site on another site?  What if you decided that you wanted to “clean up” your personal information on some of the sites where you’ve leaked this information.  Would you even know where to start?

If you answered “yes” to any of the above questions, then Appu is the tool for you.  Appu is a Chrome extension developed by my Ph.D. student Yogesh Mundada that keeps track of what we call your privacy footprint on the Web.  Every time you enter personally identifiable information (address, credit card information, password, etc.) into a Web site, Appu performs a cryptographic hash of that information, associates the hash with that site, and stores it, to keep track of where you have entered various information.  If you ever re-enter the same password on a different site, Appu will warn you that you have reused a password and where you’ve re-used that password.   As a user, you will immediately see a warning like the one below:

appu-reuse

You might be wondering: “Why should I trust your Chrome extension with the passwords that I enter on various sites?”  The good news is that you do not have to trust Appu with your passwords and personal information to use this tool, because Appu never sends your information anywhere in cleartext.  Before sending a report to us, Appu performs what is called a cryptographic hash on all of your information.  It also only stores a cryptographic hash of each password locally; no passwords are ever stored in cleartext, anywhere.  If you ever enter the same password elsewhere, the result of performing a cryptographic hash on your password would produce the same unreadable output—therefore, Appu never knows what your password is, only that you’ve reused it.  Appu stores your other personal information in cleartext locally on your machine so that you can see which sites have which values of various personal information, but it never sends that information in the clear to us.  Appu always asks the user before sending any information to us, and the tool also gives the user the option to delete anything from the reports that Appu sends to us.  If you still want to assure yourself that Appu is not doing anything suspect, you can read the source code.

Appu can help users keep track of the following information:

  • Password reuse.  Have I reused the same password across multiple sites?  If so, on which sites have I used the same password?
  • Privacy footprint.  Which sites have a copy of my full name and address (or other information)?  What specific information have I provided to those sites?
  • Password strength.  Have I used a weak password for my online bank account? (Or other Web site)
  • Password stagnancy. When was the last time I changed my password on a particular site?

In addition to the pop-up information that users see, as above, Appu also provides reports to allow users to keep track of answers to these questions.  The figures below show two examples of this.  The figure on the left shows the privacy footprint page, where a user can see which sites have stored personal information (e.g., name, email address); the figure on the right shows more detailed information, such as the last time a user changed his or her password.  That report also tells a user how often they’ve visited a site—therefore, Appu can help you figure out that even though you’ve only visited a site once, that site is storing sensitive information, such as your credit card number (hopefully spurring you to go clean up your personal information on that site).

appu-footprint
appu-report

Our hope is that Appu will help users better manage their online privacy footprints, thereby better managing the risks that they potentially expose themselves to through password reuse.

We initially released Appu through a private alpha release, to about ten close friends.  Even in this small sample size, we can observe interesting aggregate behavior.  Users are far more cavalier about their personal information than we expected.  For example, we have observed the following behaviors:

  • Although users are less forthcoming with their credit card information, they are surprisingly forthcoming about what one might otherwise think is private information, such as religious views.
  • People often share passwords across “high-value” (e.g., Amazon) and “low-value” (e.g., TripIt) sites.
  • Many users have revealed personal information (e.g., address, credit card information) to sites they rarely visit, or have visited only once.
  • Several users had weak passwords on their banking sites that could be cracked in less than one day.

Are you one of these users who needs to clean up their online privacy footprint?  Download and install the Appu Chrome extension to find out! As Appu gains a larger user base, we will follow up with more discoveries about users’ behavior regarding their online privacy footprint.  We are actively developing a Firefox version of Appu; please join the appu-users mailing list if you want to get updates aboutversion releases, news about support for other browsers.

(And yes, in case you are wondering, this project is being thoroughly reviewed by Georgia Tech’s Institutional Review Board.)

Show Me the Data

One of my friends recently pointed me to this post about network data. The author states that one of the things he will miss the most about working at Google is the access to the tremendous amount of data that the company collects.

Although I have not worked at Google and can only imagine the treasure trove their employees must have, I have also spent time with lots of sensitive data during my time at AT&T Research Labs.  At AT&T, we had—and researchers still presumably have—access to a font of data, ranging from router configurations to routing table dumps to traffic statistics of all kinds.  I found having direct access to this kind of data tremendously valuable: it allowed me to “get my hands dirty” and play with data as I explored interesting questions that might be hiding in the data itself.  During that summer, I developed a taste for working on real, operational problems.

Unfortunately, when one retreats to the ivory towers, one cannot bring the data along for the ride.  Sitting back at my desk at MIT, I realized there were a lot of problems with network configuration management and wanted to build tools to help network operators run their networks better.  One of these tools was the “router configuration checker” (rcc), which has been downloaded and used by hundreds of ISPs to check their routing configurations for various kinds of errors.  The road to developing this tool was tricky: it required knowing a lot about how network operators configure their networks, and more importantly direct access to network configurations on which to debug the tool.  I found myself in a catch-22 situation: I wanted to develop a tool that was useful for operators, but I needed operators to give me data to develop the tool in the first place.

My most useful mentor at this juncture was Randy Bush, a research-friendly operator who told me something along the following lines: Everyone wants data, but nobody knows what they’re going to do with it once they get it.  Help the operators solve a useful problem, and they will give you data.

This advice could not have been more sage.

I went to meetings of the North American Network Operators Group (NANOG) and talked about the basic checks I had managed to bootstrap into some scripts using data I had from MIT and a couple other smaller networks (basically, enough to test that the tool worked on Cisco and Juniper configurations).  At NANOG, I met a lot of operators who seemed interested in the tool and were willing to help—often they would not provide me with their configurations, but they would run the tool for me and tell me the output (and whether or not the output made sense).  Guy Tal was another person who I owe a lot of gratitude for his patience in this regard.  Sometimes, I got lucky and even got a hold of some configurations to stare at.

Before I knew it, I had a tool that could run on large Internet Service Provider (ISP) configurations and give operators meaningful information about their networks, and hundreds of ISPs were using the tool.  And, I think that when I gave my job talk, people from other areas may not have understood the details of “BGP”, or “route oscillations”, or “route hijacks”, but they certainly understood that ISPs were actually using the tool.

We applied the same approach when we started working on spam filtering.  We wrote an initial paper that studied the network-level behavior of spammers with some data we were able to collect at a local “spam trap” on the MIT campus (more on that project in a later post).  The visibility of that work (and its unique approach, which spawned a lot of follow-on work) allowed us to connect with people in industry who were working on spam filtering, had real problems that needed solving, and had data (and, equally importantly, expertise) to help us think about the problems and solutions more clearly.

In these projects (as well as other more recent ones), I see a pattern in how one can get access to “real data”, even in academia.  Roughly, here is some advice:

  • Have a clear, practical problem or question in mind. Do not simply ask for data.  Everyone asks for data.  A much more select set is actually capable of doing something useful with it.  Demonstrate that you have given some thought to questions you want to answer, and think about whether anyone else might be interested in those questions.  Importantly, think about whether the person you are asking for data might be interested in what you have to offer.
  • Be prepared to work with imperfect data. You may not get exactly the data you would like.  For example, the router configurations or traffic traces might be partially anonymized.  You may only get metadata about email messages, as opposed to full payloads.  (And so on.)  Your initial reaction might be to think that all is lost without the “perfect dataset”.  This is rarely the case!  Think about how you can either adjust your model, or adapt your approach (or even the question itself) with imperfect data.
  • Be prepared to operate blindly. In many cases, operators (or other researchers) cannot give you raw data that they have access to; often, data may be sensitive, or protected by non-disclosure agreements.  However, these people can sometimes run analysis on the data for you, if you are nice to them, and if you write the analysis code in a way that they can easily run your scripts.
  • Bring something to the table. This goes back to Randy Bush’s point. If you make yourself useful to operators (or others with data), they will want to work with you—if you are asking an interesting question or providing something useful, they might be just as interested in the answers as you are.

There is much more to say about networking research and data.  Sometimes it is simply not possible to get the data one needs to solve interesting research problems (e.g., pricing data is very difficult to obtain).  Still, I think as networking researchers, we should be first looking for interesting problems and then looking for data that can help us solve those problems; too often, we operate in reverse, like the drunk who looks for his keys under the lamppost because it is brighter where the light is shining.  I’ll say more about this in a later post.