Recently, a significant number of Windows systems experienced a global outage due to an update in the CrowdStrike security agent. The root cause was traced back to a logical error in the code used by CrowdStrike, leading to invalid memory access that caused widespread system crashes, resulting in the infamous Blue Screen of Death (BSOD).
Memory in computers is structured as a large array of numbers, often represented in hexadecimal for easier manipulation. In this case, the critical issue arose when the system tried to access memory address 0x9c (decimal 156), which is an invalid region for any program to read from. This access attempt was due to a null pointer dereference—a common pitfall in programming.
Lack of Static Code Analysis and Code Reviews: The absence of thorough static code analysis and rigorous code reviews allowed the logical error to go unnoticed before the release of the major version.
Insufficient Integration Testing: There was a lack of comprehensive and continuous integration testing to validate conflicts and errors across existing and new versions of operating systems.
Overprivileged Vendor Processes: The vendor process was running in an overprivileged kernel driver that has god-level access across the OS of mission-critical workloads, causing a crash in the system kernel driver and forcing the operating system to crash immediately as a precaution.
Uncontrolled Automated Updates: The lack of control over automated updates from the vendor prevented customers from testing breaking changes before major releases.
Identifying and fixing such issues in vendor security products is a deeply complex and expensive task for customers:
Complexity of Vendor Software: Vendor security products often have complex codebases that require significant effort and expertise to audit and maintain.
High Costs: Engaging in continuous code reviews, static analysis, and extensive testing requires substantial financial investment.
Operational Risks: Running privileged vendor processes or agents in mission-critical workloads introduces severe operational risks, outweighing the intended security benefits. Any flaw in these processes can lead to catastrophic system failures, as evidenced by this incident.
Agent-based security solutions, like the one CrowdStrike uses, rely on software agents installed on endpoints to monitor and protect systems. While effective, these agents can introduce significant risks:
Compatibility Issues: Software updates can lead to conflicts with the operating system or other installed software.
Performance Overheads: Agents consume system resources, potentially slowing down the system.
Maintenance and Updates: Regular updates are necessary, which can sometimes lead to critical failures, as seen in this incident.
Agentless security testing offers a robust alternative to traditional agent-based methods for immutable modern workloads. Here are key benefits:
No System Interference: Since there is no agent running on the endpoint, there’s no risk of compatibility issues or performance degradation.
Simplified Management: Without agents, there’s no need for constant updates and maintenance, reducing administrative overhead.
Enhanced Security: Agentless solutions minimize the attack surface by eliminating additional software that could potentially be exploited.
Prancer provides an innovative approach to security testing through its agentless and policy-as-code strategies. Here’s how Prancer addresses the key issues highlighted:
Policy as Code: Prancer utilizes policy as code to define and enforce security policies across your infrastructure. This ensures consistent and repeatable security practices without the need for intrusive agents.
Automated Static and Dynamic Analysis: Prancer’s platform includes automated static and dynamic analysis tools that can identify potential issues, such as null pointer dereferences, before they reach production.
Comprehensive Testing: Prancer’s continuous integration testing validates security controls against both existing and new versions of operating systems, ensuring compatibility and stability.
Controlled Updates: With Prancer, customers can test and validate updates in a controlled environment before deployment, avoiding the risks associated with uncontrolled automated updates.
Reduced Operational Risks: By avoiding overprivileged agent processes, Prancer significantly reduces the operational risks associated with traditional agent-based security solutions.
Cost Efficiency: Prancer’s automated and agentless approach reduces the need for extensive manual code reviews and testing, leading to cost savings.
Transparency and Control: Customers have complete visibility and control over their security policies and updates, enhancing overall security posture.
The recent CrowdStrike-induced Windows outage highlights the potential risks associated with agent-based security solutions. Moving towards an agentless security testing strategy, as exemplified by Prancer, can enhance system reliability, simplify management, and improve overall security posture. By adopting modern programming practices and leveraging advanced security testing methodologies, organizations can mitigate the risks of such disruptive incidents in the future.
While endpoint agents are still relevant for end-user computing, modern workloads should consider an agentless approach to ensure robust and resilient cybersecurity in an increasingly complex digital landscape.