The Internet broke today

…at least in some countries. In the UK even gov.uk was down. Seems that content delivery networks are wonderful until they don’t work, then they’re close to a single point of failure.
Major internet outage ‘shows infrastructure needs urgent fixing’ | Internet | The Guardian

I’m old enough to remember when one of the features of the Internet was supposed to be its reslience, derived from the original ARPANET concept. Seems like a backwards step if our quest for faster response times has led us to a structure that’s become this fragile.

How did everyone else fare during that outage - did it have much effect in the US?

This reminded me of when Dyn were DDoS’d about 5 years ago, it had similar consequences.

It caught my attention when I couldn’t read the Australian news item posted elsewhere. I got a “Varnish cache” error. Then my Amazon page wouldn’t render correctly, and then neither would Reddit. At that point I checked the Down Detector site and by then they were publishing a red banner line saying it was “probably because of Fastly.” It was around 6am my time, so I knew the day wasn’t going to start well for some people. “Luckily” for me, I have a bizarre sleep schedule, so I could sleep past it and expect a return to normalcy by the time I got up.

Didn’t see much of an outage on my end. Certainly read about it though.

Interesting point, but I would argue that the original feature of resilience applied to the physical structure of the internet, rather than anything riding within it. The internet is not a company that provides services through the internet. In this case, a company using the internet made a mistake.

The resilience of the web we enjoy today is actually incredible and totally taken for granted. I can’t count the number of alerts I get throughout a year about major fiber cuts around the country caused by construction, accidents, weather etc that are nearly all instantly mitigated.

2 Likes

Ars has a “picture” with their take: :wink:

1 Like

Tried to browse Reddit at 3a Pacific and no dice. First images wouldn’t load then the whole thing went kablooie.

The infrastructure doesn’t need fixing, so much as the operators need thinner fingers. According to Forbes’ Davey Winder, it was a configuration error. That is human failure, not infrastructure failure.

It is like the many BGP incidents that regularly crop up (in fact, that was my first thought, when the news broke on Twitter).

I was lucky, the outage passed me by, none of the sites I go to regularly, apart from Forbes, was down.

I noticed US Amazon and ukgov was down.

I suppose if any of those people impacted were that bothered they could engineer a secondary way of delivering their service independent of their chosen CDN, but prob wouldn’t pass the cost benefit test. At least they’ll be getting a service credit from Fastly :slight_smile:

In the last year, similar configuration errors have affected Microsoft’s Azure, AWS and Google clouds, among others as well as ISPs and governments around the world routing “the whole Internet” through their networks.

The problem is, the kit is redundant, as long as you don’t fat-finger a configuration change that affects all the redundant nodes - E.g. changing a routing table that tells the traffic where to go for the shortest route to the different networks.

Fat finger the change to go to all routers, instead of a specific one or a group, and the whole network is suddenly offline.

Fat finger a new route and suddenly all the traffic goes through a specific node. This is what happened to one CDN or cloud service a few months back, they managed to fat-finger a routing change and suddenly, instead of all the traffic being evenly distributed across their network and all exit points, all their traffic was trying to get out over a single node.

Edit: For the likes of UKGov or Amazon, they would still have to change their own configuration, to start using another CDN. It would break sessions, if they ran over multiple CDNs and clients were coming over arbitrary CDN connections.

Statement by Fastly, appropriately contrite, here:

1 Like

Full credit to Fastly though; I wish more companies provided such a summary after their events, let alone so quickly.

1 Like