An old friend of mine summarized engineering ethics for me once in two words: "No headlines." Meaning, I suppose, that if an engineering firm does its job right, there is no reason for it to show up in news headlines, which tend to focus on bad news.
Well, the cybersecurity firm CrowdStrike, based just up the road from me in Austin, Texas, managed to break that rule spectacularly last Friday, July 19, when they issued what was supposed to be a routine "sensor configuration update."
CrowdStrike makes cloud-based software that helps prevent cyberattacks and other security breaches, and one part of doing that involves sensing attacks. Because the nature of cyberattacks changes daily, security software firms such as CrowdStrike have to update their software constantly, and so that includes updating the sensor parts too. It's not clear to me whether it gets installed by IT departments or individuals, but I would suspect the former. Its product that was involved in the update last Friday, called Falcon, is used exclusively on Microsoft Windows machines, of which there are about 1.4 billion in the world.
Something was radically wrong with the update sent out near 11 PM Austin time, because on about 8 million PCs, a logic error in the update caused them to freeze up and exhibit the famed "blue screen of death" (BSOD). One way I tell my students that they can assess the relative importance of a given technology, is to imagine that an evil genie waves a magic wand at midnight, and suddenly all examples of that technology throughout the world vanish. How big would the disruption be?
Well, something like that happened Friday, and the disruptions made a ton of headlines. Most major U. S. airlines suddenly found themselves without a scheduling or ticketing system. Schools and hospitals across the U. S. were deprived of their computer systems. Some 911 emergency-call systems in some cities crashed.
CrowdStrike CEO and co-founder George Kurtz issued an apology Saturday, saying on a blog post "I want to sincerely apologize directly to all of you for today's outage." Once their engineers discovered what was going on, CrowdStrike rushed to provide a fix, which involved rebooting the paralyzed PCs into safe mode, deleting a certain file, and rebooting. But multiply that fairly simple task, which could be done easily by an IT tech and with difficulty by anyone else, times 8.5 million PCs, and it was clear that this mess wasn't going to be cleaned up overnight.
As computer foulups go, this one was fairly minor, unless you were trying to get somewhere by plane over the weekend. I don't know for sure, but it's possible it could have been avoided if CrowdStrike had a policy of trying out each of their updates on a garden-variety PC to make sure it works. Maybe they did, and there's some subtle difference between their test bed and the 8.5 million PCs that froze up. That's for them to figure out, assuming that they weather the storm of lawsuits that may arise on the horizon once the accountants of affected organizations figure out how much revenue was lost in the flight delays, scheduling problems, and other issues caused by the glitch.
The crowning irony of the whole thing was, of course, that the problem was caused by software that was designed to prevent problems. This isn't the first time that safety equipment turned out to be dangerous. In the auto industry, a years-long slow-motion tragedy was caused by the carelessness of Takata, a manufacturer of airbag inflators, which sold inflators with a defect that caused them to detonate and send flying metal shrapnel into the car's passengers instead of just inflating the airbag. After years of recalls, Takata declared bankruptcy in 2017 and is out of business.
One hopes that this single screwup will not spell doom for a cybersecurity company that up to now seems to have been doing a good job of preventing computer breaches and otherwise keeping out of trouble. It's a public corporation with about 8,000 employees, so it's unlikely that giant firms such as American Airlines could recoup their losses without just bankrupting the whole outfit. If Microsoft itself was directly responsible, that would be another question, but Microsoft's only involvement was the fact that the product was used only on Windows machines.
This whole episode can serve as a cautionary experience to help us prepare for something bigger that might come down the technology pike in the future. Malicious actors are constantly trying to exploit vulnerabilities for various nefarious purposes, ranging from vandalistic amusement all the way up to strategic military incursions mounted against multiple countries. It would be worth while to imagine the worst that could happen computer-wise, and then at least ask the question, "What would we do about it?"
My sister works at a large hospital where they have toyed with the idea of deliberately turning off all their computers once every so often, and trying to keep their operations going with paper and phones. They've never mustered the nerve to do it, partly because there are some things that would be flat impossible to do, and the reduction in service capabilities would be a disservice to the public they have committed to serve.
But for organizations that could manage it, it would be a worthwhile exercise to see if doing without computers for a set time is possible at all, and what would have to change to make it possible if it isn't presently.
In researching this article, I discovered that of those 1.4 billion PCs running Windows out there, about 1 billion of them are still running Windows 10, which is set to go out of business some time in 2025. I happen to own one of those legacy Windows 10 machines that can't be upgraded to Windows 11 because of some newfangled Windows 11 hardware requirement. So we can expect another disruption around October of 2025 when Windows 10 support ends. Let's just hope it isn't as sudden and startling as the CrowdStrike blue screens of death.
Sources: I consulted the articles "Huge Microsoft Outage Caused by CrowdStrike Takes Down Computers Around the World" at https://www.wired.com/story/microsoft-windows-outage-crowdstrike-global-it-probems/, "CrowdStrike discloses new technical details behind outage" at https://www.scmagazine.com/news/crowdstrike-discloses-new-technical-details-behind-outage, https://www.zdnet.com/article/is-windows-10-too-popular-for-its-own-good/
for the statistic about Windows computers, and the Wikipedia article on CrowdStrike.
 
 
 
 
No comments:
Post a Comment