Facebook suffered one of its largest outages on Monday after being down for over five hours (what's that in percentage?). What happened?
Think about the last time you tracked a package that you ordered online. You might have seen a list of all the different places your packages traveled to – service hubs, facilities, distribution centers, and then finally your local post office for delivery.
The internet is a similar kind of network of networks. Border Gateway Protocol (BGP) is the protocol that picks the fastest and most efficient route for your packages to travel on the internet. Different providers along the way advertise and exchange routing information with each other. Just like BGP works on the internet, some companies also use it to route inside their extensive data centers. Facebook advertises its own BGP inside its datacenters and out on the internet – so that it can make speedy updates when it changes its underlying infrastructure.
So what happened with Facebook? A combination of bugs caused a mistaken change to go through that misconfigured Facebook's data centers globally and took them offline. These data centers also act as DNS servers for the internet (you can think of DNS as an address book). Facebook's servers were configured to remove BGP routing when the address book is down – since that's usually a sign of a broken network connection. That means that Facebook ended up automatically removing all routes to Facebook that were advertised on the internet.
When internal networks are down, internal debugging can be extremely hard since those tools run on the internal network! With so many employees working from home, those engineers couldn't make any changes once the network went down.
Once the root cause was identified, Facebook had to carefully bring things back online without breaking them – imagine trying to turn on every device in your house at the same time.
Are there any lessons to be learned? For one, as we move to a more remote workforce, DevOps and Site Reliability Engineers will need to adapt their tools and processes to account for scenarios like this. Another lesson is the fragility of large-scale network configuration. It's difficult to tests or sandbox changes like this that mostly can only be observed in a production scenario.
In complex systems, it's rare to have a single point of failure and extremely difficult to test for scenarios where everything seems to go wrong.
You can read Facebook's official response here.