May 28, 2013 3 Comments
On March 27, 2013 at 6:20 a.m. UTC, the SeaMeWe-4 cable outage affected connectivity across the world. SeaWeMe-4 is currently the largest submarine cable connecting Europe and Asia. The Renesys blog recently covered the effect of this outage on various parts of Asia and Africa (Pakistan, Saudi Arabia, the UAE, etc.). In this post, we explore how the fiber cut affected connectivity from other parts of the world, as visible from the BISmark home router deployment. The credit for the data analysis in this blog post goes to Srikanth Sundaresan, one of Georgia Tech’s star Ph.D. students whose work on BISmark has garnered a number of awards.
The BISmark project has been deploying customized home gateways in home broadband access networks around the world for more than two years; we currently have more than 130 active home routers measuring the performance of access links in nearly 30 countries. The high-level goal of the project is to gather information from inside home networks to help users and ISPs better debug their home networks. Two years ago, we published the first paper using BISmark data in SIGCOMM. The paper explores the performance of broadband access networks around the United States and has many interesting findings:
- We showed how a technique called “interleaving” on DSL networks can introduce tends of milliseconds of additional latency on home networks.
- We explored how a user’s choice of equipment can introduce “bufferbloat” effects on home access links.
- We showed how technologies such as PowerBoost can also introduce sudden, dramatic increases in latency when interacting with buffering on the access link.
The image below shows the current deployment of BISmark. We have more than 80 routers in North America, nearly 20 in Southeast Asia, about fifteen in the European Union, about 15 in South Africa, and about ten in East Asia. You can explore the data from the deployment yourself on the Network Dashboard; all of the active measurements are available for download in raw XML format as they are collected.
Each BISmark router sits in a home broadband access network. The routers are NetGear WNDR 3700 and 3800s; we ship routers to anyone who is interested in participating. As an incentive for participating, you gain access to your own data on the network dashboard. We are also actively seeking researchers and developers; please contact us below if you are interested, and feel free to check out the project GitHub page.
Every BISmark router measures latency to the Google anycast DNS service and to 10 globally distributed Measurement Lab servers every 10 minutes. Those servers are located in Atlanta, Los Angeles, Amsterdam, Johannesburg, Nairobi, Tokyo, Sydney, New Delhi, and Rio de Janiero.
Effects of the SMW4 Fiber Cut: A Case Study
We first explore the effects of the fiber cut on reachability from the active BISmark routers to each of the Measurement Lab destinations. At the time of the outage (6:20a UTC), the Measurement Lab server in Nairobi became completely unreachable for more than four hours. The Nairobi Measurement Lab server is hosted in AS 36914 (KENet, the Kenyan Education Network).
Connectivity was restored at 10:34a UTC. Interestingly, between 9a and 10a UTC, reachability from many of our other BISmark routers to all of the Measurement Lab destinations was affected. We have not yet explored which of the BISmark routers experienced these reachability problems, but, as we explore further below, this connectivity blip coincides with some connectivity being restored to Kenya via Safaricom, the backup ISP for the Measurement Lab server hosted in KENet. It is possible that other convergence events were also occurring at that time.
Analysis of the BGP routing table information from RouteViews shows that connectivity to AS 36914 was restored at 10:34a UTC. The following figure shows the latencies from all nodes to Nairobi before and after the outage. As soon as connectivity returns, the first set of latencies seem to be roughly the same as before, but latencies almost immediately increase to all destinations, except for a router situated in South Africa in AS 36937 (Neotel). This result suggests that Neotel may have better connectivity to destinations within Africa than some other ISPs, and that access ISPs who use Neotel for “transit” may see better performance and reliability to destinations within the continent. Because only the SEACOM cable was affected by the cut, not the West African Cable System (WACS) or EASSy cable, Neotel’s access to other fiber paths may have allowed its users to sustain better performance after the fiber cut.
This incident—and Neotel’s relative resilience—suggests the importance of exploring the effects of undersea cable connectivity in various countries in Africa and how such connectivity affects resilience. (In a future post, we will explore the effects of peering and ISP interconnectivity on the performance that users in this part of the world see.)
Internet Routing to KENet during the Outage
6:20a: The Fiber Cut. The reachability and performance effects caused by the SWM4 fiber cut beg the question of what was happening to routes to Kenya (and, in particular KENet) at the time of the outage. We explore this in further detail below. The first graph below shows reachability to KENet (AS 36914, the large red dot) at 6:20:50 UTC, around which time the fiber cut occurred. The second plot shows the routes at 6:23:51 UTC; by 6:27:06 UTC, AS 36914 became completely unreachable.
9:05a: Connectivity is (partially) restored through a backup path. About two-and-a-half hours later, at 9:05:49 UTC, AS 36914 starts to come back online, and connectivity is restored within about one minute, although all Internet paths to this destination go through AS 33771 (SafariCom), which is most likely KENet’s backup (i.e., commercial, and hence more expensive) provider. This is an interesting example of BGP routing and backup connectivity in action: Many ISPs such as KENet have primary and backup Internet providers, and paths only go through the backup provider (in this case, SafariCom) when the primary path fails.
Note that although connectivity to KENet was restored through SafariCom at around 9:06a UTC, none of the BISmark routers could reach the Measurement Lab server hosted in KENet through this backup path! This pathology suggests that the failover didn’t really work as planned, for some reason. Although this disconnection could result from poor Internet “peering” between SafariCom and the locations of our BISmark routers around the world, it is unlikely that bad peering would affect reachability to all of our destinations. Still it is not clear why the connectivity through SafariCom was not sufficient to restore connectivity to at least some of the BISmark nodes. The connectivity issue we observed could be something mundane (e.g., SafariCom simply blocks ICMP “ping” packets), or it could be something much more profound.
It is also interesting to note that Internet routing took more than two hours to restore! Usually, we think of Internet routing as being dynamic, automatically reconverging when failures occur to find a new working path (assuming one exists). While BGP has never been known for being zippy, two-and-a-half hours seems excessive. It is perhaps more likely that some additional connectivity arrangements were being made behind the scenes; it might even be the case that KENet purchased additional backup connectivity (or made special arrangements) during those several hours when they were offline.
10:35a: Connectivity returns through the primary path. At around 10:34a UTC, routes to KENet begin reverting to the primary path, as can be seen in the left figure below. By 10:35a UTC, everything is “back to normal” as far as BGP routing is concerned although as we saw above, latencies remain high to most destinations for an additional eight hours. It is unclear what causes latencies to remain high after latencies were restored, but this offers another important lesson: BGP connectivity does not equate to good performance through those BGP paths. This underscores the importance of using both BGP routing tables and a globally distributed performance measurement platform like BISmark to understand performance and connectivity issues around the times of outages.
It’s worthwhile to reflect on some of the lessons from this incident; it teaches us about how Internet routing works (and doesn’t work), about the importance of backup paths, and about the importance of performing joint analysis of both routing information and active performance measurements from a variety of globally distributed locations. I’ve summarized a few of these below:
- Peering and interconnectivity in Africa haven’t yet come of age. It is clear from this incident that certain locations in Africa (although not all) are not particularly resilient to fiber cuts. The SWM4 fiber cut took KENet completely offline for several hours, and even after connectivity was “restored” several hours later, many locations still could not reach the destination through the backup path. Certain ISPs in Africa that are better connected (e.g., Neotel, and the Measurement Lab node hosted in TENET in Johannesburg) weathered the fiber cut much better than others, most likely because they have backup connectivity through WACS or EASSy. In a future post, we will explore performance issues in various parts of Africa that likely result from poor peering.
- Connectivity does not imply good performance. Even after connectivity was completely “restored” (at least according to BGP), latencies to Nairobi from most regions remained high for almost another eight hours. This disparity underscores the importance of not relying solely on BGP routing information to understand the quality of connectivity to and from various Internet destinations. Global deployments like BISmark are critical for getting a more complete picture of performance.
- “Dynamic routing” isn’t always dynamic. The ability for dynamic routing protocols to find a working backup path depend on the existence of those paths in the first place. The underlying physical connectivity must be there, and the business arrangements (peering) between ISPs must exist to allow those paths to exist (and function) when failures do occur. Something occurred on March 27, 2013 that exposed a glaring hole in the Internet’s ability to respond dynamically to a failure. It would be very interesting to learn more about what happened between 6:20a UTC and 9:05a UTC to learn more about exactly what resulted in connectivity being restored (via SafariCom), and why it took so long. Perhaps we need more sophisticated “what if” tools that help ISPs better evaluate their readiness for these types of events.
In future posts, we will continue to explore how BISmark can help expose pathologies that result from disconnections, outages, and other pathologies. Our ability to perform this type of analysis depends on the continued support of ISPs, users, and the broader community. We encourage you to contact us using the form below if you are interested in hosting a BISmark router in your access network. (You can also post public comments at the bottom of the page, below the contact form.)