It’s Not The Next Outage

(This column is posted at www.StevenSavage.com, Steve’s Tumblr, and Pillowfort. Find out more at my newsletter, and all my social media at my linktr.ee)

So the CrowdStrike Outage of 2024 happened. Actually, let me clarify, the CloudStrike Outage of July 2024. I might as well be clear because that was a doozy and it showed some wide-raging system instabilities. Also considering it was such a disaster maybe there’s another.

If you don’t know what I’m talking about, an update to some security software bricked a lot of windows machines in a disaster that shouldn’t have happened. If “scrutiny software shut down systems” sound bad, yes it was!

If “security disaster happened” AND you work in IT, AND your friends are nerds and/or work in IT, you know MY experience. I spent most of that Friday quietly losing my mind.

Of course there’s questions of “how do we avoid the next outage” which is sort of sad, because you’d kind of like there not to be one, or one as widespread. But I don’t think that’s quite the issue, preparing for the next Giant Ooposie misses two things.

First, this exposed just how vulnerable systems are, and I’m worried about intentional attacks. We saw in real time how a software update could destroy systems. We saw how people did – or didn’t recover. We saw where vulnerabilities might be. We wondered what would have happened had this been during another crises – hurricane, terrorist attack, etc.

CrowdStrike was a mix of blueprint, roadmap, and test run for how to screw up IT systems worldwide. This is what you get by accident, meaning intentional attacks are now much easier to pull off effectively. We need to worry about intention.

Imagine a CrowdStrike-like outage but with more destructive not just an issue that an in theory be fixed by booting 15 times. Something designed to not be recoverable, an IT WMD.

Secondly, we’ve just seen that many major systems are just plain vulnerable period. Everyone is on Windows, a lot of people use CrowdStrike, and recovery plans were individual. Though I was impressed with the global recovery, if you’re an IT pro or hang out with them (I do both) you know this was not easy.

Recovering from a one-shot, caught, error is one thing. But it’s a reminder that we are very vulnerable and might want to be questioning about how a lot of infrastructure is set up. How many smaller-scale disasters do we not see because it wasn’t big news? My general take is systems need to be easier to recover, more diverse, and honestly more walled off.

Also we need to stop depending on heroism in IT security. It should be incredibly boring.

The next CrowdStrike type error should not happen. But right now my concern is what happens intentionally, what may happen on a smaller scale at first, and that we’re probably not ready for either.

CrowdStrike was a wake-up call to so many things wrong in modern infrastructure, so many things that could go wrong. As much as the company screwed up massively there’s far more to worry about.

Steven Savage