Some Tech 101 (maybe 201) about How Google (and others) Go Offline
A few days ago, Google went offline. You may have missed it. Jennifer Rexford at Princeton pointed folks to this article about the specific event. Better yet, the article goes into the way the Internet works (well the routing part which was the issue) and how to fix it. One thing that jumped out at me is that humans, yes HUMANS!, are still a big part of the system, and that trust or maybe a Social Life of Information play big roles along with the hardware and software. One of the people who identified the source of the issue called (not email, phone, Paul Ohm and Mike Madison who note my preference for phones) someone they knew at the source. I post the details below as I think it shows the way the system works:
The solution was to get Moratel to stop announcing the routes they shouldn’t be. A large part of being a network engineer, especially working at a large network like CloudFlare’s, is having relationships with other network engineers around the world. When I figured out the problem, I contacted a colleague at Moratel to let him know what was going on. He was able to fix the problem at around 2:50 UTC / 6:50pm PST. Around 3 minutes later, routing returned to normal and Google’s services came back online.
Looking at peering maps, I’d estimate the outage impacted around 3–5% of the Internet’s population. The heaviest impact will have been felt in Hong Kong, where PCCW is the incumbent provider. If you were in the area and unable to reach Google’s services around that time, now you know why.
Building a Better Internet
This all is a reminder about how the Internet is a system built on trust. Today’s incident shows that, even if you’re as big as Google, factors outside of your direct control can impact the ability of your customers to get to your site so it’s important to have a network engineering team that is watching routes and managing your connectivity around the clock. CloudFlare works every day to ensure our customers get the optimal possible routes. We look out for all the websites on our network to ensure that their traffic is always delivered as fast as possible. Just another day in our ongoing efforts to #savetheweb.