Internet Relativism and the Hunt for Elusive “Ground Truth”

Networking and security research often rely on a notion of ground truth to evaluate the effectiveness of a solution.  “Ground truth” refers to a true underlying phenomenon that we would like to characterize, detect, or measure.  We often evaluate the effectiveness of a classifier, detector, or measurement technique by how well it reflects ground truth.

For example, an Internet link might have a certain upstream or downstream throughput; the effectiveness of a tool that measures throughput could be thus be quantified in terms of how close its estimates of upstream and downstream throughput are in comparison to the true throughput of the underlying link.  Since there is a physical link with actual upstream or downstream throughput characteristics—and the properties of that link are either explicitly known or can be independently measured—measuring error with respect to ground truth makes sense.  In the case of analyzing routing configuration to predict routing behavior (or detect errors), static configuration analysis can characterize where traffic in the network will flow and whether the configuration will give rise to erroneous behavior; either the predictions correctly characterize the behavior of the real network, or they don’t.  A spam filter might classify an email sender as a legitimate sender or a spammer; again, either the sender is a spammer or it is a legitimate mail server.  In this case, comparing against ground truth is more difficult, since if we had a perfect characterization of spammers and legitimate senders, we would already have the perfect spam filter.  The solution in these kinds of cases is to compare against an independent label (e.g., a blacklist) and somehow argue that the proposed detection mechanism is better than the existing approach to labeling or classification (e.g., faster, earlier, more lightweight, etc.).

Problem: Lack of ground truth.  For some Internet measurement problems, the underlying phenomenon simply cannot be known—even via an independent labeling mechanism—either because the perpetrator of an action won’t reveal his or her true intention, or sometimes because there actually is no “one true answer”. Sometimes we want to characterize scenarios or phenomena where the ground truth proves elusive.  

Consider the following two problems:

  • Network neutrality.The network neutrality debate centers around the question of whether Internet service providers should carry all traffic according to the same class of service, regardless of various properties such as what type of traffic it is (e.g., voice, video) or who is sending or receiving that traffic.
  • Filter bubbles.  Eli Pariser introduced the notion of a filter bubble in his book The Filter Bubble.  A filter bubble is the phenomenon whereby each Internet user sees different Internet content based on factors ranging from our demographic to our past search history to our stated preferences.  Briefly, each of us sees a different version of the Internet, based on a wide range of factors.

These two detection problems do not have a notion of ground truth that can be easily measured.  In the latter case, there is effectively no ground truth at all.

In the case of network neutrality, detection boils down to determining whether an ISP is providing preferential treatment to a certain class of applications or customers.  While ground truth certainly exists (i.e., either the ISP is discriminating against a certain class of traffic or it isn’t), discovering ground truth is incredibly challenging: ISPs may not reveal their policies concerning preferential treatment of different traffic flows, for example.

Similarly, in the case of filter bubbles, we want to determine whether a content provider or intermediary (e.g., search engine, news aggregator, social network feed) is manipulating content for particular groups of users (e.g., showing only certain news articles to Americans).  Again, there is a notion of ground truth—either the content is being manipulated or it isn’t—but the interesting aspect here is not so much whether content is being manipulated (we all know that it is), but rather what the extent of that manipulation is.  Characterizing the extent of manipulation is difficult, however, because personalization is so pervasive on the Internet: everyone effectively sees content that is tailored to their circumstances, and there is no notion of a baseline that reflects what a set of search results or a page of recommended products might look like before the contents were tailored for a particular user.  In many cases, personalization has been so ingrained in data mining and search that even the algorithm designers are unable to characterize what “ground truth” content (i.e., without manipulation) might look like.

Relativism: measuring how different perspectives give rise to inconsistencies.  In cases where ground truth is difficult to measure or impossible to know, we can still ask questions about consistency.  For example, in the case of network neutrality, we can ask whether different groups of users experience comparable performance.  In the case of filter bubbles, we can ask whether different groups of users see similar content.  When inconsistencies arise, we can then attempt to attribute a cause to these inconsistencies by controlling for all factors except for the factor we believe might be the underlying cause for the inconsistency.  One might call this Internet relativism, in a way: We concede that either there is no absolute truth, or that the absolute truth is so difficult to obtain that we might as well not try to know it.  Instead, we can explore how differences in perspective  or “input signals” (e.g., demographic, geography) give rise to different outcomes and try to determine which input differences triggered the inconsistency.  We have applied this technique to the design of two real-world systems that address these two respective problem areas.  In both of these problems, we would love to know the underlying intention of the ISP or information intermediary (i.e., “Is the performance problem I’m seeing a result of preferential treatment?”, “(How) is Google, Netflix, or Amazon manipulating my results based on my demographic?”).

  • NANO: Network Access Neutrality Observatory.We developed NANO several years ago to characterize ISP discrimination for different classes of traffic flows.  In contrast to existing work in this area (e.g., Glasnost), which requires a hypothesis about the type of discrimination that is taking place, NANO operates without any a priori hypothesis about discrimination rules and simply looks for systematic deviation from “normal” behavior for a certain class of traffic (e.g., all traffic from a certain ISP, for a certain application, etc.).  The tricky aspect involved in this type of detection is that there is no notion of normal.  For example, ISP Y might also be performing similar type of discrimination, so there is no firm ground truth against which to compare.  Ideally, what we’d like to ask is “What would be the performance that this user see using ISP X vs. the performance they would see if they were not using ISP X?”  Unfortunately, there is no reasonable way to test the performance that a user would experience as a result of not using an ISP.  (This is in contrast to randomized treatment in clinical trials, where it makes sense to have a group of users who, say, are subject to a particular treatment or not.)  To address this problem, the best we could do to establish a baseline was to average the performance seen by all users from other ISPs and compare those statistics against the performance seen by a group of users for the ISP under test.
  • Bobble: Exposing inconsistent search results.  We recently developed Bobble to characterize the inconsistencies that exist in Web search results that users see, as a result of both personalization and geography.  Ideally, we would like to measure the extent of manipulation against some kind of baseline.  Unfortunately, however, the notion of a baseline is almost meaningless, since no Internet user is subject to such a baseline—even a user who has no search history may still see personalized results based on geography, time of day, device type, and other features, for example.  In this scenario, we established a baseline by comparing the search results of a signed-in user against a user with no search history, making our best attempt to hold all other factors constant.  We also performed the same experiment with users who were not signed in and had no search history, varying only geography.  Unlike NANO, in the case of Bobble, there is not even a notion of an “average” user; the best we can hope for are meaningful characterizations of inconsistencies.

Takeaways and general principles.  These two problems both involve an attempt to characterize an underlying phenomenon without any hope of observing “ground truth”.  In these cases, it seems that our best hope is to approximate a baseline and compare against that (as we did in NANO); failing that, we can at least characterize inconsistencies.  In any case, when looking for these inconsistencies, it is important to (1) enumerate all factors that could possibly introduce inconsistencies; and (2) hold those factors fixed, to the extent possible.  For example, in NANO, one can only compare a user against average performance for a group of users that have identical (or at least similar) characteristics for anything that could affect the outcome.  If, for example, browser type (or other features) might affect performance, then the performance of a user for an ISP “under test” must be compared against users with the same browser (or other features), with the ISP being the only differing feature that could possibly affect performance.  Similarly, in the case of Bobble, we must hold other factors like browser type and device type fixed when attempting to isolate the effects of geography or search history.  Enumerating all of these features that could introduce  inconsistencies is extremely challenging, and I am not aware of any good way to determine whether a list of such features is exhaustive.

I believe networking and security researchers will continue to encounter phenomena that they would like to measure, but where the nature of underlying phenomenon cannot be known with certainty.  I am curious as to whether others have encountered problems that call for Internet relativism, and whether it may time to develop sound experimental methods to characterize Internet relativism, rather than simply blindly clamoring for “ground truth” when none may even exist.


Could We Ever See An Internet “Kill Switch”?

This past week I was interviewed on the evening news about Egypt’s decision to cut itself off from the Internet, and whether we might ever see something similar in the United States, in light of recent proposed legislation on a “kill switch” for the Internet.  I decided to elaborate on my thoughts from this interview.

Internet censorship is more pervasive than most people realize; it takes place in nearly 60 countries.  Last week, however, we witnessed an unprecedented event, whereby Egypt shut down Internet connectivity entirely.  In Egypt, all Internet traffic passes through only five Internet service providers (ISPs), so a government can shut off Internet traffic by controlling just a few Internet service providers.  An Internet “kill switch” is really a metaphor: these five Egyptian ISPs went offline one-by-one over the course of a few hours, with the large national incumbent ISP, Telecom Egypt, leading the way, and three other ISPs following their lead over about 15 minutes.  Four days later, a fifth ISP, which hosts many of the Egyptian financial institutions, was shut down.  Each of these ISPs has what is known as a “border router”, a device that forwards all traffic between its users inside Egypt and the rest of the world.  In the same way that network operators in this country can install filters to stop traffic from certain unwanted locations (e.g., from spammers), the Egyptian operators likely installed a filter to drop all Internet traffic between Egypt and the rest of the world.

This action raises many questions.  In light of recently proposed “kill switch” legislation (see page 76), many people wonder whether something similar could happen here in the United States. This outcome is unlikely.  First, the United States has more ISPs and connection points to the rest of the world.  Cutting us off would require control over many more Internet service providers and points of entry. Even if the government could order all ISPs to cut ties with the rest of the world, we would not feel the same impact: Because many of the sites and services that citizens want to access (such as Facebook and Twitter) actually host their services within our borders, cutting off the United States from the rest of the world might have a larger impact on the rest of the world than it would on us.

Another question is whether Internet access is a human right as much as water, electricity, and food, and, if we think it is, how we might guarantee that citizens have free and open access to information in the face of repressive or authoritarian governments.  There has been much work on circumventing censorship, but governments eventually block them.  A complementary strategy might be to somehow entangle the traffic of ordinary users with traffic associated with critical infrastructure and activities.  The OECD released an estimate that Egypt lost as much as $90 million as a result of the Internet shutdown.  Egypt only shut down the ISP hosting the Egyptian stock exchange for one day instead of five.    The more everyone’s Internet traffic is intertwined with the traffic that is critical to a country’s revenue and operations, the more difficult it is for a government to cut off Internet access to its citizens without crippling its own operations.

Finally, while having no Internet access whatsoever seems dire, it is worth asking whether there might be worse scenarios.  While a complete shutdown of the Internet is certainly inconvenient and costly, a more competent government or organization might use the Internet to persuade and control its citizens, perhaps by sending propaganda or misinformation through services like Twitter and Facebook.  We did see small-scale instances of this, where Egypt forced a large cellular provider, Vodafone, to spread propaganda through text messages.  A far more disturbing scenario may occur when a government harnesses the Internet to spread misinformation or influence public opinion.  Unlike a complete shutdown, such manipulation is more subtle: since it doesn’t disrupt information exchange, the average user may not even notice it.  In fact, it is likely that this practice may already be occurring, perhaps even here at home.

Software-Defined Networking and The New Internet

Tonight, I am sitting on an panel sponsored by NSF and Discover Magazine about “The New Internet”.  The panel has four panelists who will be discussing their thoughts on the future of the Internet.  Some of the questions we have been asked to answer involve predictions about what will happen in the future.  Predictions are a tall order; as Yogi Berra said: “It is hard to make predictions, especially about the future.”

Predictions aside, I think one of the most exciting things about this panel is that we are having this discussion at all.  Not even ten years ago, Internet researchers were bemoaning the “ossification” of the Internet.  As the Internet continues to mature and expand, the opportunities and challenges seem limitless.  More than a billion people around the world now have Internet access, and that number is projected to at least double in the next 10 years. The Internet is seeing increasing penetration in various resource-challenged environments, both in this country and abroad.  This changing landscape presents tremendous opportunities for innovation.   The challenge, then, is developing a platform on which this innovation can occur.  Along these lines, a multicampus collaboration is pursuing a future Internet architecture that proposes to architect the network to make it easier for researchers and practitioners to introduce new, disruptive technologies on the Internet.  The “framework for innovation” that is proposed in the work rests on a newly emerging technology called software-defined networking.

Software-defined networking. Network devices effectively have two aspects: the control plane (in some sense, the “brain” for the network, or the protocols that make all of the decisions about where traffic should go), and the data plane (the set of functions that actually forward packets).  Part of the idea behind software-defined networking is to run the network’s control plane in software, on commodity servers that are separate from the network devices themselves.  This notion has roots in a system called the Routing Control Platform, which we worked on about five years ago and now operates in production at AT&T.  More recently, it has gained more widespread adoption in the form of the OpenFlow switch specification.  Software-defined networking is now coming of age in the NOX platform, an open-source OpenFlow controller that allows designers to write network control software in high-level languages like Python. A second aspect of software-defined networking is to make the data plane itself more programmable, for example, by engineering the network data plane to run on hardware.  People are trying to design data planes that are more programmable with FPGAs (see our SIGCOMM paper on SwitchBlade), with GPUs (see the PacketShader work), and also with clusters of servers (see the RouteBricks project).

This paradigm is reshaping how we do computer networking research.  Five years ago, vendors of proprietary networking devices essentially “held the keys” to innovation, because networking devices—and their functions—were closed and proprietary.  Now a software program can control the behavior not only of  individual networking devices but also of entire networks.  Essentially, we are now at the point where we can control very large networks of devices with a single piece of software.

Thoughts on the New Internet. The questions asked of the panelists are understandably a bit broad. I’ve decided to take a crack at these answers in the context of software-defined networking.

1. What do you see happening in computer networking and security in the next five to ten years? We are already beginning to see several developments that will continue to take shape over the next ten years. One trend is the movement of content and services to the “cloud”. We are increasingly using services that are not on our desktops but actually run in large datacenters alongside many other services.  This shift creates many opportunities: we can rely on service providers to maintain software and services that once required dedicated system and network administration.  But, there are also many associated challenges.  First, determining how to help network operators optimize both the cost and performance of these services is difficult; we are working on technologies and algorithms to help network operators better control how users reach services running in the cloud to help them better manage the cost of running these services while still providing adequate performance to the users of these services. A second challenge relates to security: as an increasing number of services move to the cloud, we must develop techniques to make certain that services running in the cloud cannot be compromised and that the data that is stored in the cloud is safe.

Another important trend in network security is the growing importance of controlling where data goes and tracking where it has been; as networks proliferate, it becomes increasingly easy to move data from place to place—sometimes to places where it should not go.  There have been several high-profile cases of “data leaks”, including a former Goldman Sachs employee who was caught copying sensitive data to his hedge fund.  Issues of data-leak prevention and compliance (which involves being able to verify that data did not leak to a certain portion of the network) are becoming much more important as more sensitive data moves to the Internet, and to the cloud.Software-defined networking is allowing us to develop new technologies to solve both of these problems. In our work on Transit Portal, we have used software routers to give cloud service providers much more fine-grained control over traffic to cloud services. We have also developed new technology based on software-defined networking to help stop data leaks at the network layer.

2. What is the biggest threat to everyday users in terms of computer security? Two of the biggest threats to everyday users in terms of computer security are the migration of data and services to the cloud and the proliferation of well-provisioned edge networks (e.g., the buildout of broadband connections to home networks).  The movement of data to the cloud offers many conveniences, but it also presents potentially serious privacy risks.  As services ranging from email to spreadsheets to social networking move to the cloud, we must develop ways to gain more assurance over who is allowed to have access to our data.  Another important challenge we will face with regards to computer security is the proliferation of well-provisioned edge networks. The threat of botnets that mount attacks ranging from spam to phishing to denial-of-service will become even more acute as home networks—which are, today, essentially unmanaged—proliferate. Attackers look for well-connected hosts, and as connectivity to homes improves and as the network “edge” expands, mechanisms to secure the edge of the network will also become more important.

3. What can we do via the Internet in the future that we can’t do now? The possibilities are limitless.  You could probably imagine that anything you are doing in the real world now might take place online in the future.  We are even seeing the proliferation of entirely separate virtual worlds, and the blending of the virtual world with the physical world, in areas such as augmented reality.  Pervasive, ubiquitous computing and the emergence of cloud-based data services make it easier to design, build, and deploy services that aggregate large quantities of data.  As everything we do moves online, everything we do will also be stored somewhere.  This trend poses privacy challenges, but, if we can surmount those challenges, there may also be significant benefits, if we can develop ways to efficiently aggregate, sort, search, analyze and present the growing volumes of data.

The Economist had a recent article that suggested that the next billion people who come onto the Internet will do so via mobile phone; this changing mode of operation will very likely give rise to completely new ways of communicating and interacting online.  For example, rural farmers are now getting information about farming techniques online; services such as Twitter are affecting political dynamics, and may even be used to help defeat censorship.

Future capabilities are especially difficult to predict, and I think networking researchers have not had the best track record in predicting future capabilities.  Many of the exciting new Internet applications have actually come from industry, both through large companies and from startups.  Networking research has been most successful at developing platforms on which these new applications can run, and ongoing research suggests that we will continue to see big successes in that area.  I think software-defined networking will make it easier to evolve these platforms as new applications develop and we see the need for new applications.

4. What are the big challenges facing the future of the Internet? One of the biggest challenges facing the future of the Internet is that we don’t really yet have a good understanding of how to make it usable, manageable, and secure.  We need to understand these aspects of the Internet, if for no other reason than we are becoming increasingly dependent on it.  As Mark Weiser said, “The most profound technologies are those that disappear.”  Our cars have complex networks inside of them that we don’t need to understand in order to drive them.  We don’t need to understand Maxwell’s equations to plug in a toaster.  Yet, to configure a home network, we still need to understand arcana such as “SSID”, “MAC Address”, and “traceroute”.  We must figure out how to make these technologies disappear, at least from the perspective of the everyday user.  Part of this involves providing more visibility to network users about the performance of their networks, in ways that they can understand.  We are working with SamKnows and the FCC on developing techniques to improve user visibility into the performance of their access networks, for example.  Software-defined networking probably has a role to play here, as well: imagine, for example, “outsourcing” some of the management of your home network to a third party service who could help you troubleshoot and secure your network.  We have begun to explore how software-defined networking could make this possible (our recent HomeNets paper presents one possible approach).  Finally, I don’t know if it’s a challenge per se, but another significant question we face is what will happen to online discourse and communication as more countries come online; tens of countries around the world implement some form of surveillance or censorship, and the technologies that we develop will continue to shape this debate.

5. What is it going to take to achieve these new frontiers? The foremost requirement is an underlying substrate that allows us to easily and rapidly innovate and frees us from the constraints of deployed infrastructure.  One of the lessons from the Internet thus far is that we are extraordinarily bad at predicting what will come next.  Therefore, the most important thing we can do is to design the infrastructure so that it is evolvable.

I recently read a debate in Communications of the ACM concerning whether innovation on the Internet should happen in an incremental, evolutionary way or whether new designs must come about in a “clean slate” fashion.  But, I don’t think these philosophies are necessarily contradictory at all: we should be approaching problems with a “clean slate” mentality; we should not constrain the way we think about solutions simply based on what technology is deployed today. On the other hand, we must also figure out how to deploy whatever solutions we devise in the context of real, existing, deployed infrastructure.  I think software-defined networking may effectively resolve this debate for good: clean-slate, disruptive innovation can occur in the context of existing infrastructure, as long as the infrastructure is designed to enable evolution.  Software-defined networking makes this evolution possible.

Internet Censorship: Then and Now

I began working on Internet censorship nearly ten years ago, when Professors Hari Balakrishnan and David Karger talked about users who were behind the “Great Firewall of China” and their need to get more ready access to information.  In this post, I’ll talk about the state of censorship and censorship back then, how the landscape has changed, the lessons I have learned along the way, and my initial thoughts on future research in this area.

Censorship Then

Ten years ago, Internet use was exploding in the United States, so it was initially somewhat hard for me to comprehend that censorship and surveillance were taking place in other parts of the world, let alone what a pervasive problem censorship would become.  Intuition would suggest that the spread of Internet access would provide citizens with more access to information, not less.  In practice, however, the opposite can be true: the Internet gives a government a finite and fixed set of points from which they can monitor or restrict access.  The Berkman Center has a web site where they report on the complexity of the internal networks within a variety of countries.  Essentially, they are comparing the complexity of ISP interconnections within a number of countries: the more “rich” these Internet connections, the more difficult it is for a country to restrict, monitor, or block content.  Most remarkable are the ISP structures of countries like China, where most ISPs connect through a single backbone network (presumably where the blocking takes place); compared to Nigeria, for example, the Chinese network is much more like a hub and spoke, with all regional ISPs connecting through the ChinaNet Backbone (which is the parent of nearly 2/3 of all of the countries IP address space).

Nearly ten years ago, the Berkman Center published a nice report about the state of Internet censorship in China, exposing the extent of censorship in China and the determination of the government to develop more refined and sophisticated censorship techniques.  In response, people have developed techniques to try to circumvent censorship techniques.  Conceptually, every circumvention system works roughly as shown in this picture:

The helper has access to content outside the censorship firewall and can communicate with Alice, who is behind the firewall.  The helper’s job is to allow Alice and Bob to exchange content.  In practice, this helper might be a Web proxy (e.g., Anonymizer), a network of proxies (e.g., Tor), or, as we will see below, an intermediate drop site (e.g., Collage).

In response to ongoing censorship efforts at the time, we developed Infranet, a system to circumvent censorship firewalls.  The state of the art in circumventing censorship at the time (e.g., Anonymizer) were essentially glorified Web proxies: a user in a censored regime would connect to a cooperating proxy outside of the firewall, which would, in turn, fetch content for the user and return that content over an encrypted channel.  However, censors could discover and block such proxies, and simply connecting to a proxy like this could raise suspicion.  In other words, existing proxy-based systems lacked two important properties:

  • Robustness. The mechanism that citizens use to circumvent censorship should be robust to the censor’s attempt to disrupt, mangle, or block the communication entirely.  Most existing systems (even widely used anonymization tools like Tor) are not inherently robust, because censors can block entry and exit nodes.
  • Deniability. Users of an anti-censorship system could be subject to extreme sanctions or repercussions.  For example, last year, a Chinese blogger was stabbed; many believe the outspoken nature of his blog to have provoked the violence.  Due to the consequences of violating censorship regulations, users in countries such as China even practice what is known as “self-censorship”: pre-emptively avoiding the exchange of content that might be incriminating or otherwise subject to censorship.  Therefore, any censorship system must also be deniable: that is, the users of the system must be able to deny that they were even using the system in the first place.  Achieving this goal is more difficult on the Internet than with certain communications media (e.g., radio, television), and most existing tools for circumventing censorship of providing anonymity do not achieve deniability, either.

Infranet relies on a covert channel between the user behind a firewall and the censor outside of the firewall.  The main idea is to allow a user to “cloak” a request for some censored Web site in other seemingly innocuous Web traffic.  In the case of Infranet, the proxy outside of the firewall hosted a Web server itself.  A user would issue requests for content on that Web site, but the proxy would interpret those sequence of requests as a coded message that was actually requesting some other censored content.  Despite its improvements over existing technology, Infranet did not gain widespread adoption, for (I think) two reasons:

  1. Simple schemes worked. When we talked to Voice of America about the tool, they said that most people were happy with simple proxy-based schemes; of course, the proxies had to move continually, but by the time the censors found out the new locations of the proxies and managed to block access to them, the proxy had moved to a new location.  Infranet was the circumvention equivalent of pounding a thumbtack with a sledgehammer.
  2. It required too much effort. Most censorship or anonymization tools require “helpers” outside of the censorship firewall that the censored users can communicate with.  For example, someone might need to set up a machine that runs a secure Web proxy.  Running Infranet required philanthropic users to run an Apache Web server, patch it with special software, and then face the prospect that their legitimate content hosted on the site might be blocked as a result of trying to help.  All of this seems like too much to ask.

The lack of adoption was frustrating, and it seemed difficult to have real, measurable impact.  The research problems also seemed fuzzy, ill-defined, and unsolvable.

Censorship Now

Two important developments have occurred since that time, however, both of which give me more hope that this topic area both has interesting research questions and the potential for impact.

On the downside, censorship is becoming more pervasive. Many countries around the world have gone to råemarkable lengths to restrict access to content on the Internet.   According to the Open Net Initiative, twelve countries around the world have implemented some pervasive form of censorship.  Internet censorship has also played a significant role in political events, such as the Iranian elections.  A report by Freedom House last year also reported on the state of censorship, and found that censorship is now prevalent in nearly 60 countries around the world.  Internet censorship also matters more as increasingly more people use the Internet to communicate.

On the other hand, building circumvention tools is easier. One of the major problems with Infranet was that it required philanthropic individuals to host dedicated infrastructure.  Since ten years ago, however, “Web 2.0” has made it much easier for the average user to publish content on the Internet.  Users no longer have to maintain our own Web servers to host photos, videos, etc.  It occurred to me, then, that censorship circumvention technologies could also ride the Web 2.0 wave, using infrastructure in the cloud as the foundation for hiding information and building covert channels.  Sites such as Flickr that host “user generated content” appeared the perfect place to create “drop sites” for users to hide and exchange censored content.

Collage. Based on these observations, we designed Collage, which allows users to hide messages in photos that they post to user-generated content sites like Flickr and Twitter.  The tool allows message senders to hide messages in photos and tweets and upload them to respective user-generated content sites.  Its design has several advantages.  First, it does not require users to set up fixed infrastructure (e.g., Web servers).  Second, it uses erasure coding to “spread” any single message across multiple drop sites, making the system more robust to blocking than a proxy-based system.  Collage appeared at USENIX Security Symposium last Friday (paper here) and has appeared in the press recently.  Time will tell whether this tool sees more widespread adoption.

Lessons and Looking Forward

My experience with research in Internet censorship taught me an important lesson for research in general: continually reconsider old problems. An old problem that was once uninteresting or unsolvable might become tractable because of other, seemingly unrelated developments.  In the case of Collage, the advent of Web 2.0 allowed us to significantly advance the state of art over a system like Infranet.  It is worth repeatedly asking yourself about what bearing a particular development might have on any other problem, even if the two areas seem unrelated.  You might find the right-sized hammer for your nail in the most unlikely of places.

We still don’t understand very much about Internet censorship.  We are still trying to understand its extent.  We have even less understanding of how various circumvention technologies work in practice.  It’s even harder to try to “measure” the level of deniability or robustness that a censorship circumvention tool might provide.  Debugging is also difficult: When a certain circumvention technology fails, is the failure a bug, or a direct consequence of censorship?  Finally, getting the software into the hands of the people who need it and helping them get set up (“bootstrapping”) remains a challenging problem, particularly considering that any information that a normal user could get is also accessible to a censor.  Given this wide array of open questions—ranging from theory to practice, and from technology to policy—I believe we may be at the dawn of a new and exciting research area.