Crowdstrike, Testing, and Security
A simple error took down countless systems — what are the implications?
A friend my mine texted me on Friday, and said, “It’s a great week to be a cybersecurity salesman…”, following the recent Crowdstrike and Microsoft BSOD fiasco. A relative of mine, who isn’t super technical like me, asked me how things like this can keep happening. “Someone should have to pay for this kind of thing”, they said. That got me thinking about this particular incident — and I though that it might be helpful to explain the situation to people who aren’t part of the software industry.
The Costs of the CrowdStrike Error
Look at the initial impact to the stock price of CrowdStrike, to get a feel for the initial financial impact of this mistake. CrowdStrike was valued at $74B before this event, and their share price lost 11% of it’s value on Friday. Simple math puts the loss at $8.14B in stock value. That’s $8B in shareholder value, vaporized in a single day, by a mistaken software update. Eight. Billion. Dollars.
That’s just the impact on the shareholders. What about the long term impact to CrowdStrike? In an industry where reputation and trust are differentiating qualities in competitors, a mistake like this can create havoc. The business press is already beginning to forecast how competitors will use this to take advantage of CrowdStrike and eat away at their customer base, although the impact may not be as great as some might think. Reputation and trust are key qualities in this market. By communicating effectively during this crisis, CrowdStrike may be able to greatly mitigate the damage done here. But let’s be clear — there will be a negative impact on their ability to sell in the future.
This is just the costs to CrowdStrike. What about the general costs to the overall economy. This event caused flight cancellations and delays, delays to medical procedures, and all sorts of other havoc. What is the cost of that? Will the people and organizations suffering those losses pursue legal action against CrowdStrike?
How Did This Happen?
The next scary thing to consider is exactly how this happened. What I have heard reaported is that it was a software update that was automatically pushed out to customer systems. Is this REALLY what happened? I’m not going all “conspiracy theory guy” here, I would just like a little bit more transparency and description of what actually happened.
There are a lot of malware and ransomware actors who must be watching this and thinking, “This thing had a larger impact than ANYTHING that I could have done”. If this was the result of someone delivering a bad software update, what is to stop a disgruntled employee from doing something like this again? What about an employee in financial difficulty providing a “bad actor” access to systems, for some sum of money?
I really want to know how a software change was able to make it into software that was remotely delivered to millions of customer machines, without having been tested, reviewed, approved, etc. When nuclear reactors melt down, when planes crash, when bridges collapse, we do a post-mortem and we share the results with the broader community. Was it a technical failure? Was it a procedural failure? Did poor design contribute to the accident? How do we prevent this from happening in the future? What can we collectively learn from this?
So far I have seen no evidence in there being any kind of “open” investigation into this. That worries me.
When Did I Stop Being an Engineer?
I was trained and educated as a “Software Engineer”. The whole phrase indicates a level of discipline and control — an “engineer” will know enough about his/her particular domain that they will provide robust and optimal solutions to problems in that domain. An Electrical Engineer will provide solutions that are primarily electrical in nature, with calculations of current, voltage, and any safety concerns. A Structural Engineer will provide safe and sustainable structures, with calculated loads, ability to bear weight, etc.
Software still seems to be a discipline that is “part art/part science”. We have some well known “rules of thumb”, or “best practices”, that most of us agree on. There are some well-known methodologies (like Agile, Kanban, Waterfall, and others) that are used to develop software. Most software development efforts have the concept of various testing phases, with testing focused on certain areas of risk (unit testing, component testing, system testing, penetration testing, security testing, etc.). We know that we need to build systems and avoid single points of failure. Continuous testing and other DevOps concepts also help support more disciplined software engineering efforts.
Yet most software developers will claim that there is an “art” to software development. How each of us will frame and solve problems with software is different. The patterns and algorithms used to address problems are often common — there is ample study of these design patterns (like the “Gang of Four Design Patterns”) in software development. How those patterns get implemented are the “art” in software development.
I’m constantly embarrassed by the lack of “engineering” discipline that is employed in the market today. Software companies have this, “oh well, this happens to everyone”, or a “move fast and break things” philosophy — which is the exact opposite of software “engineering”. If a doctor, or a lawyer, or a structural engineer, had the types of problems that we encounter in software, and had the response that we seem to have, they would be legally responsible for their actions. Yet in the software world, we celebrate innovators who forge ahead without consideration for any of the collateral damage that their creations may cause.
How many software problems are result of “bad actors”, and how many are the result of “bad engineers”? If you design a house with no locks on the doors, the theft of the owners belongings cannot be fully blamed on the “bad actors” who came along and robbed the place. The builder needs to accept responsibility.
To take this analogy further — I am fully aware that there are trade off decisions to be made. While my house has some security features (locks, alarms, exterior lighting, etc.), it isn’t as secure as the vault at my local bank. My bank puts a different value on the whole idea of security — the risks are more profound. I am not advocating that every software solution be like a bank vault — but I am advocating that we need to start capturing and being honest about the levels of security (as well as other engineering concerns) that we are building into our software creations.
When will software companies and software engineers begin to display some true engineering discipline?
Where Do We Go From Here?
Computing and software have not been around that long — less than a hundred years. The physics and chemistry that provide the foundation for many other engineering disciplines is much more mature and well established. So I understand that we have a longer path to travel. We need to begin to travel that path.
With the advent of AI technologies, which can be non-deterministic and require different testing and validation techniques, the need for true engineering discipline has never been more necessary. We need to be able to provide controls and a degree of certainty to our software creations — both those that are done with traditional software, and those that take advantage of AI techniques and capabilities.
So if you are creating these solutions, or having some “software engineers” create these solutions for you, make sure that you ask questions. Do we know the limitations of our product or solution? Do we have a way to monitor it for problems? Are we adequately thinking about and addressing things that will happen during the lifetime of this solution?
Don’t accept complex tech vocabulary and acronym loaded explanations. Good engineers should be able to explain these things to you in English. They should have documentation and monitoring for various risk factors. They should be able to explain the design trade-offs that have been made. Demand true engineering discipline. Your company might depend on it.