Global IT Outage involving Windows linked to Crowdstrike

It appears that overnight (starting Thursday night 2024-July-18 and ongoing into Friday morning) that some sort of an issue is causing some Windows PCs globally to “Blue Screen of Death”.

If you experience this issue and are CloudStrike customer there is mitigation instructions in this article:

Other coverage:

https://www.washingtonpost.com/technology/2024/07/19/microsoft-windows-outage-blue-screen-bsod/

Dodged a bullet there, we were in talks earlier this year with Crowdstrike, but went with a different provider in the end…

Apparently, as a different mitigation, there is a registry change to block CrowdStrike from starting:

HKLM:\System\CurrentControlSet\Services\CSAgent\Start
Change from a 1 to a 4 to disable CrowdStrike from starting.

1 Like

Subreddit megathread here

BSOD error in latest crowdstrike update : r/crowdstrike (reddit.com)

UK NHS primary care is pretty much down as a result of this. No access to medical records, just been to see my doctor, they can’t prescribe or refer until they get their servers up. I assume it has also impacted treatment/operations and emergency too.

Yes, we were affected around 2:30pm. MS Teams started exploding with people reporting BSOD and boot loops. At that stage, my laptop was still okay. About an hour later, my laptop also succumbed to the boot loops.

I eventually got it back by getting the Bitlocker recovery key from my corporate Microsoft account and getting the laptop into safe mode and deleting a .sys file in the Crowdstrike folder. After that, the laptop just worked.

I feel Monday is got to be a very busy day for the IT infrastructure team manually fixing hundreds, if not thousands, of laptops. I was fortunate that, as a developer, I had admin access to my laptop to make the necessary file deletion.

My bank is still down - no payments are going in and out apart from card payments. Many major stores’ POS systems are down - I didn’t venture out to the shops today. People expecting to be paid today will be very disappointed.

Thankfully, the emergency services don’t seem to be affected at all; some public transport has been affected in other states, but it seems to have totally bypassed the South Australia. A lot of airports are affected.

And on an unrelated issue, Microsoft has been having problems with Azure and Microsoft 365.

image

1 Like

Steve Gibson isn’t going to be short of material for the next Security Now.

Meanwhile I am reminded that there used in the 1980s to be a collection of snarky sayings called The Devil’s DP Dictionary, which included this little gem:
One-line change, n: a programming change so small that it requires no testing before crashing the live system.”

4 Likes

Disclaimer: I work for a VAR that resells Crowdstrike and uses it internally (fortunately, we only had 1 server and 3 laptops have an issue).
It’s hard to believe they have this much market share that an “oops” pretty much took the entire world down.

1 Like

His reply to my post about it on his newsgroup was a chuckle

From: "Steve Gibson" <news008_@_grc.com>
Subject: Re: Global IT outage appears blamed on MS
Date: Fri, 19 Jul 2024 07:46:08 -0700
Message-ID: <v7du7d$pi5$1@GRC>
Lines: 13

[for the unabridged version, see Paul Holder's post above]

This is unacceptable! I was unable to use my mobile app to order 
my coffee ahead and have it waiting for me! It was like what the 
Pilgrims had to put up with.

<g>

Interesting times indeed!
2 Likes

My bank is still not fully recovered. Fast transfers aren’t working. We have to wait a day to get our money like in prehistoric times.

They don’t have so much market share, but they do have a lot of very important customers. They are a favourite of the big cloud providers, like Google, Amazon and Microsoft, for example.

That said, I haven’t heard of any of our suppliers or customers being affected…

They are very expensive, so it is mainly larger companies that can afford them, and those larger companies tend to have larger, more globally widespread customer bases, so when something happens, it affects a lot of people.

This has happened a few times before, with other big AV companies over the years. If it had been Sophos, McAfee, Kaspersky etc. it would have had similar results.

2 Likes

I doubt my bank is affected, they were still using OS/2 up until 2010-2012… :rofl:

1 Like

Still no access to UK NHS medical records after 24 hours, so must be a bigger issue than fixing some boot-looping Wintel boxes. That’s the problem with this sort of outage, it can get complicated, especially if you have critical data with lots of pending updates queued up all over the system.

1 Like

It could be that the system relies on a really large number of Windows boxes (like one per office/surgery/hospital ward/pharmacy) and most of those locations don’t have anyone with the technical knowledge to follow instructions about putting the PC into recovery mode and using the command line to locate and delete one specific file.

Most people who use PCs know how to turn it off and on again if anything goes wrong, and to be fair that fixes an awful lot of problems.

Then if you have a technically knowledgeable staffing level based on the *assumption that most problems can be fixed by remoting in, and issues that require someone to be present are really rare, then you might have 5, 10, 15? tech staff driving all over the country to do the fix on each box.

Microsoft was talking about “turning it off then on up to 15 times” but I think that was for a few edge cases. From what I’ve read, there’s mostly no alternative to being hands on keyboard on-site to fix them.

I also get the impression that there may be quite a lot systems provided by small third-party suppliers involved in moving data around in the NHS (I may be wrong). If that’s the case, they will be based on the model of supplying locations with a standard desktop PC ready to run with specialist software, and a small tech team for fixes (see assumptions). If you have a lot of small suppliers, that could be a lot of mileage being clocked up in uncoordinated fix visits.

*Assumptions. Beloved of bean-counters, they eventually come back to bite you.

2 Likes

Probably what’s happened. I can imagine the response from a cut-back IT dept, or more likely a 3rd party when it becomes clear they have to physically visit tens (hundreds?) of thousands of devices to deploy a fix manually.

2 Likes

“When in danger or in doubt,
Run in circles, scream and shout”.

1 Like

Saw there was a doctor on the BBC earlier today saying that her practice was badly hit, but she also knew of ones that were unaffected. Does sound like it could be lots of systems from different suppliers. Which is normally a recipe for resilience, until they all start relying on the same corporate AV supplier on the same OS.

2 Likes

I keep reading about which hospitals had issues and which ones didn’t. That just tells me who uses Crowdstrike. We also sell one of their competitors. I’m waiting for them to come out with a Crowdstrike replacement promo…

1 Like

Just like the Kaspersky replacement promo 2 years back, which will probably be doing the rounds again in the USA at the moment.

1 Like

I can tell you that if you got majorly hit by this, the remediation is painful. If you are on prem, it’s not so bad because you can easily boot your server into safe mode to do the recovery. But, if you are in the cloud (Azure, AWS, etc) this is 10x harder. In Azure, you either need to build a recovery VM to do this, or mount the disk on a different VM. It took me 90 minutes to get one server recovered due to this process (of course, that was after repeatedly rebooting to see if it would fix itself). People published automated ways to do this, but if the machine doesn’t boot up far enough to get the updated scripts/etc, that won’t work.

My company works with many healthcare, so I’ve seen many EMR systems. Most are 20-30 servers, on the small side. The bigger you are, the bigger the footprint. Plus, the hospitals won’t turn them back on until they’ve validated everything. As to why DR wasn’t an option, most systems are replicated to the DR site, so the corrupted system would be what is sitting there.

I stand by Crowdstrike as a product. I’ve seen it prevent attacks that could have been bigger. This is why all these companies use the product. But this “oops” will be a huge black spot for them.

2 Likes

This was a rough one. Worked about 20 hours since Friday AM when our machines starting going down. Haven’t worked an outage like this since probably around 2015, and while it’s not something I miss, it was nostalgic being in the trenches with my fellow sysadmins.

Gonna go play in the woods now to detox for a bit.

3 Likes