18

A global outage was caused when Crowdstrike pushed a bad content file. That's well-covered in many other places so I won't elaborate on that.

What I'm interested in is whether company policies to delay automatic updates would have contained the outage.

Crowdstrike has a way to set policies to limit the Falcon sensor version to n-1, n-2, or even a specific build. So you can assign a test group to get the most recent sensor version and your production groups to get a slightly earlier version.

This is a common practice used for other types of patching, such as for Windows updates. I understand that the pace is necessarily faster for information security products, but perhaps some form of vetting might have been possible.

The file that caused the problem is classified as a "content file", and so it's possible that it wouldn't have been prevented by sensor update policies.

On the other hand, Dave Plummer's Youtube video suggested that Crowdstrike was using content updates to patch the sensor code without having to go through Microsoft's driver approval process every time. And the sensor version numbers do appear to increase fairly rapidly. So it's also possible that the policies also control content updates.

So, can we say if a Crowdstrike customer had set up a procedure to test machines against sensor updates before approving them for general release within their company, that the outage could have been contained to test group(s)?

Spencer
  • 378

2 Answers2

18

Spoke with their support and Falcon versions do not delay content updates, so those with n-1 were still impacted.

Jacob Evans
  • 8,431
16

Probably not, due to who the application owners typically are and CrowdStrike and security products in general update philosophy.

These products usually self-update frequently, with almost no formal communication. This is (subjectively) different from a generic application major or minor binary update.

The vendors, and their counterparts at organizations aren't disciplined ITIL practitioners and never will be ...

About seven years ago I was on a similar incident with Trend Micro. They released an update that cratered every Windows cluster we had due to it interfered with lookups for the Cluster Name Object. We were on a conference call with 70 people.

When Microsoft called in, they told us right away that several other customers with the same product had the same problem...

Fortunately not as many organizations had that product.


This is not a traditional product where the customer plans and prepares for updates by mitigating risk and identifying contingencies. This product is co-managed by the vendor, and they are taking actions that the customer is unaware of.

In a sense, this is more of a service than a product. Concerns about implementations and management should be reinforced by contract, or by selecting a different solution or service.


Worth noting that the offending update was a file that is intended to gather routine information for analysis (basic threat telemetry). Also the component that failed was far back in the release and distribution process, and was in fact the component that was supposed to validate that the file contained the correct information. So the validation mechanism designed specifically to prevent this failure failed in the completely wrong way.

https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/


Information surfaced in the weeks since the incident have resulted in modest changes to how updates are deployed, including the capability to actually manage the process.

However a more interesting reveal is that the application was/is not in the current state capable of identifying and rolling back this failed update, and likely many other types of failures. The vendor is engaging multiple firms for reviewing the various mechanics of the endpoint architecture and update process.

Probably worth noting that the endpoint is a fraction of their total technology estate compared to traditional security vendors. This is due to the amount of infrastructure they own in AWS (both for customer data streams and an extensive API infrastructure) and the data that they mine from it.

https://www.darkreading.com/cyber-risk/crowdstrike-will-give-customers-control-over-falcon-sensor-content-updates

Greg Askew
  • 39,132