Monday, October 28, 2019

A Pilot and Software Engineer's Take on the Boeing 737 Max


As of this writing, the ill-fated Boeing 737 Max series of jetliners is still grounded after two fatal crashes in which the pilots lost a battle with the plane's Maneuvering Characteristics Augmentation system (MCAS).  The U. S. Federal Aviation Administration (FAA) grounded the planes last March, and current estimates are that the planes will not be flying again before at least  2020.  This is a huge blow to Boeing and its customers who bought the planes, as billions of dollars of assets are sitting idly on the runway instead of making money. 

Only a month after the planes were grounded, a software engineer named Gregory Travis, who is also a pilot, wrote his thoughts on what happened with the Max 8 and why he thinks the problem may be intractable.  A version of his article appeared on the website of IEEE Spectrum recently, and to my mind it is the most comprehensive and damning examination yet of a situation that put thousands of lives at risk and ended up killing 346 people.

Travis points out that the 737 series was introduced all the way back in 1967.  Designing an airframe from the bottom up is a costly enterprise, so Boeing understandably would like to make incremental changes to an existing design rather than coming up with a whole new airplane every few years.  As fuel economy became more important for airlines, Boeing decided to go with more efficient engines, which for fundamental physical reasons have to be larger.  But eventually, the newer engines got so big that the ground clearance in their original positions was too small—the front fans were going to hit the ground if they didn't move the engines.  So they did move them upward and back.  But that caused another problem.

Travis drew on his experience as a pilot to note that you start playing with the fundamental handling characteristics of an aircraft when you move the engines around.  Stable flight is a complex interplay between the engine thrust vector and the center of gravity, the drag on the wings and other surfaces, and many other factors.  When the engines were moved, it made the plane tend to pitch upward with increased power, and this is not a good thing.  Upward pitch is to an airplane what tilting your head up is to your head. 

If an aircraft's pitch exceeds a certain angle, depending on the angle of attack (the angle between the plane's fuselage and the air moving past it), it can stall, which basically makes it fall out of the air.  The modified 737 was edging dangerously close to a dynamically unstable condition, which is not something a commercial airliner should do.  Travis said that the right thing to do at this point would have been to redesign the whole airframe to deal with the changed position of the engines.  In his words, "The airframe, the hardware, should get it right the first time and not need a lot of added bells and whistles to fly predictably. This has been an aviation canon from the day the Wright brothers first flew at Kitty Hawk." 

But instead of doing that, Boeing chose to develop a software patch that included the MCAS—a complicated system of interacting compensation fixes, pilot warnings, and poorly considered feedback loops that were vulnerable to faulty inputs from angle-of-attack sensors, which can easily be fooled by surface winds or other transient phenomena. 

Most modern airliners are "fly-by-wire" systems in which there is no direct mechanical connection between the pilot's stick and pedals, and the airplane's control surfaces.  Instead, a computer both takes in the pilot's commands and feeds back to the pilot something approximating the "feel" of manually operated controls, so that the pilot senses he or she is flying a plane and not a video game.  But the MCAS was apparently designed so that when it sensed a situation in which the nose needed to be pointed down, it would in effect grab the controls away from the pilot and do what it knew was right—even if it was wrong.  And the feedback motors that would do this were simply too powerful for the pilots to overcome.  In a reference to the famous HAL 9000 computer in the film 2001: A Space Odyssey, in which the computer tries to kill everyone on board for its own rather obscure purposes, Travis writes "MCAS gaslights the pilots. And it turns out badly for everyone. 'Raise the nose, HAL.' 'I’m sorry, Dave, I’m afraid I can’t do that.'"

We are well down the road that leads to 100% control of airplanes by robotic systems.  Nevertheless, we are far from arriving, and in the meantime there has to be effective and safe cooperation, not competition, between the human pilots and the software that runs the plane.  But in trying to cut corners by fixing an airframe problem with software, and poorly designed software at that, Boeing may have painted itself, and all its customers who bought 737 Max 8s, into a corner that it can't get out of.  Every month that goes by without an FAA-approved plan to fix or retrofit Max 8s so they can fly safely again is an indication that the problem revealed by the MCAS-related crashes may be deeper and more far-reaching than most people thought at first.  The fact that an engineer with deep expertise in both software and flying saw what was evidently going on within a month of the groundings tells me that he's probably on to something.

The historian of technology Henry Petroski says that engineers often learn more from failures than from successes.  We should learn a lot from the 737 saga, but it may prove to be an expensive lesson.  The 737 Max began commercial flights only in 2017, and I'm sure Boeing and its customers were counting on many years of revenue from their purchases.  If the design ends up being scrapped, it will amount to the largest recall in aviation history.  But if even just most of what Travis says is true, that is well within the realm of possibility.  Regardless of what patches Boeing may come up with, I'm never going to feel entirely comfortable flying in a 737 Max again. 

Sources:  Readers are urged to see Travis's complete article, which goes into greater depth than I have been able to here.  It is on the website of IEEE Spectrum at https://spectrum.ieee.org/aerospace/aviation/how-the-boeing-737-max-disaster-looks-to-a-software-developer.

Monday, October 21, 2019

Hard Rock Hotel Collapse: Why?


On Saturday morning, Oct. 12, a hotel under construction at the corner of Rampart and Canal streets in New Orleans, Louisiana underwent a partial collapse, killing three workers and injuring 30.  The Hard Rock Hotel, originally planned as a mixed retail/residential project, had reached a height of 13 stories when something happened to cause a collapse at the top completed level.  A chain of floor collapses ensued, leading to a partial collapse of all the floors above about the seventh level.  The collapse also damaged the two tower cranes that were being used on the project, leading to concerns that they might fall and damage some of the surrounding structures in the densely populated downtown area.  At this writing (Wednesday, Oct. 16), the body of one worker has yet to be recovered.

Any time a construction accident occurs, the entire complex process of planning, management, and actual construction activity gets called into question.  The construction of a large high-rise such as the Hard Rock Hotel is an exercise in meticulous coordination and integration of technologies ranging from computer-aided design to the kind of pumps that can send many tons of concrete all the way up to the roof of a 13-story building.  With so much heavy stuff being supported in temporary ways, it's understandable that something could go wrong. 

For example, the concrete floors that are poured at each level have to set before they are put into compression by tensioning cables.  Try to tighten those cables too early, and you're liable to squash the still-weak concrete.  But wait too long by a day or so, and you've added costly time to the construction schedule.  A huge number of time-critical matters have to be coordinated within a small margin of error for things to go smoothly, and weather, supplier problems, and other external factors can throw a monkey wrench into the works. 

Still, most buildings go up without having multiple floors collapse on each other.  Viewed from the front, the structure looks like a giant finger just scraped all the floors above the seventh and bent them downward. 

A structural engineer named Walter Zehner once worked on the project in its early stages.  When contacted by a reporter from the Lafayette Daily Advertiser, he said that it was much too early even to speculate on the cause of the collapse.  After retreival of the remaining fatality, engineers will have to stabilize the structure so that it won't present an ongoing hazard to surrounding buildings.  Only then will the investigation begin, and it might take months.

Construction was in progress at the time of the collapse, and Zehner says that the remaining eyewitnesses will be asked what exactly was being done at the time.  It's possible that someone accidentally knocked over a support column, for example.  If a heavy just-poured layer of concrete falls twelve or fifteen feet onto the floor below it, the impact could well cause the next floor to collapse, leading to just the kind of destruction that took place.  But all such notions are speculation at this point, and the investigation will reveal a sequence of events that may be traced backwards to a possible cause.

In the recent past there have been some indictments of city inspectors for taking bribes.  A lack of proper municipal oversight might lead to hazardous conditions that could cause such a collapse, but again, this is speculation. 

The most recent collapse of a structure under construction that was covered in this blog was the Florida International University pedestrian bridge collapse in Miami in 2018.  Six people were killed when a concrete-beam bridge collapsed just after being set in place.  The investigation of that accident is still ongoing, but late last year it was revealed that the National Transportation Safety Board (NTSB) had determined design errors were at least partly to blame. 

Accidents like the Hard Rock Hotel collapse can happen even if the plans are flawless.  The 1981 collapse of a pedestrian walkway inside the atrium of the Hyatt Regency Hotel in Kansas City was due not to any flaws in design, but to a compromise that the builder made in the support structure during construction.  Investigations may reveal that while the New Orleans hotel plans were correct, the builders may have overlooked something.  Or it could turn out that a single mistake made by one construction worker led to the tragedy. 

Not much is known about the extent of training that typical construction workers receive.  Construction is one of the few remaining fields in which a person without a high-school degree can earn at least in the range of $13 an hour, which is the average construction-worker wage in Louisiana according to a statistic cited by the website indeed.com.  This is scarcely anything to write home about, unless home is Guatemala, in which case it looks good compared to trying to be a subsistence farmer.  Nevertheless, it's attractive enough to draw workers who are willing to face the dangers and difficulties that construction work involves, up to and including the chance of dying in a tragic accident.

We will have to wait to find out what exactly happened in New Orleans to transform a nearly completed building into a pile of dangerous rubble.  And when we do, I hope that any lessons learned will be applied to future construction sites so that tragedies like this happen less and less frequently. 

Sources:  I referred to reports from the ABC News website at https://abcnews.go.com/US/search-underway-unaccounted-person-hard-rock-hotel-partially/story?id=66261708, the Lafayette Daily Advertiser website at https://www.theadvertiser.com/story/news/2019/10/14/hard-rock-hotel-collapse-new-orleans/3979427002/, and the Wikipedia website "List of structural failures and collapses."  The hourly construction wage statistic came from https://www.indeed.com/salaries/Construction-Worker-Salaries,-Louisiana.  I discussed the FIU bridge collapse at https://engineeringethicsblog.blogspot.com/2018/12/design-flaw-identified-in-fiu-bridge.html.

Monday, October 14, 2019

PG&E Pulls the Fire Plug


If you were one of an estimated two million customers of Pacific Gas & Electric in northern California this week, your power went off for a day or more.  There was no malfunction of the power grid.  Instead, the utility deliberately shut off power in large regions where high winds were predicted, in order to avoid sparking forest wildfires of the kind that have killed over a hundred people in recent years.  During the outage, the utility's website crashed, making it difficult or impossible for people to find out if they lived in an area targeted for an outage.  According to an article about the blackouts in the Wall Street Journal, California Governor Gavin Newsome reacted with "outrage," blaming the precautionary outages on PG&E's "greed and mismanagement over the course of decades."  PG&E CEO Bill Johnson said he might have some disagreements with Gov. Newsome, but that he was not ruling out similar outages in the future.

Reliable electric power is one of the mainstays of modern civilization.  Because most utilities outside large cities rely on above-ground transmission and distribution lines, their power grids are subject to what the lawyers call acts of God:  windstorms and ice storms that down power lines, lightning and floods that damage and destroy equipment, and other natural occurrences that disrupt the smooth delivery of power.  As long as these interruptions are rare and end promptly, no one blames the utility for them.  But the deliberate large-scale blackouts PG&E imposed simply as a precautionary measure are something new.

The Journal article points out that California now has a law making utilities liable for any damage caused by fires that are ignited by their lines, even if the utility was not negligent.  This law contributed to over $30 billion in potential liability costs associated with power-line-sparked wildfires and was a big reason why PG&E went into bankruptcy proceedings at the beginning of this year.  I don't know the history of that particular law, but it's consistent with a blame-the-powerful attitude that also seemed to inspire Gov. Newsome's comments.

Blaming the powerful is one thing, especially if they're guilty, but crippling a vital utility through excessively punitive laws is another thing.  The parties to this conflict include PG&E's management, workers, and investors, who mainly just want to do their job and/or get paid for it; PG&E's customers, who want reliable electric power without having their houses burn down; and the rest of California, which includes its government, along with the accompanying laws and regulatory environment.  Each group has interests that potentially conflict with the others, and these blackouts highlight the areas of conflict.

I lived in California for the four years of my undergraduate degree outside of Los Angeles in the 1970s, and I vividly recall waking up one day to see a dark cloud of smoke covering the entire northern half of the sky as a wildfire burned out of control in the San Gabriel Mountains.  Even back then, I thought people built houses in crazy places in California, on the edges of cliffs and so on, and it's only gotten worse since then.  Fires that used to damage nothing but wildlife (which is bad enough) now threaten whole communities, and so the need to control them by whatever means necessary has grown in recent years.

Part of that control is making sure that no tree can come anywhere close to a high-voltage transmission line.  PG&E has a tree-trimming program, but they admit they are behind in their scheduled trimming operations, and they also lack the ability to monitor winds at many specific locations so as to restrict the power outages to where they are really needed.  And even if they had such monitoring abilities, their older equipment doesn't allow them to be very selective in the power lines they de-energize—hence the massive blackouts covering a wide area. 

From here, the outages look like a desperate move by a utility company that is hamstrung by regulations and unfavorable laws.  If PG&E was a human being and not a large corporation, it would strike me as unfair to make him liable for damage even if negligence could not be proved.  If a driver is in a situation where he is obeying all the traffic laws, and a child suddenly runs out from a hidden place and gets hit, the driver generally does not get penalized if there was nothing he could have done to avoid the accident. 

But if you have an attitude that large private corporations are infinite money pots from which lawyers and their clients can extract indefinite amounts of money, sooner or later you run up against reality.  If PG&E doesn't have enough money or staff or freedom from regulations to cut away all their trees from their lines, and they risk the corporate equivalent of death if their lines cause a fire, then the precautionary blackouts look like the least bad alternative. 

Civilization is a huge mesh of cooperation:  buyers cooperating with sellers, consumers cooperating with producers, and government, one hopes, encouraging the virtuous kind of cooperation that leads to prosperous and flourishing societies.  But when groups begin to view other groups mainly as enemies and attribute malign motives to them, you can end up with a kind of self-fulfilling prophecy.  The maligned group or entity may think, "Well, these folks believe I'm a bad hombre no matter what I do, so I might as well act like it." 

The precautionary wind blackouts are a sign that maybe PG&E has been pushed too far.  We can hope that a spirit of conciliation can prevail so that more trees can be trimmed, more customers served more reliably, and fewer fires lead to tragedy in the future.  But right now, the prospects for that don't look too bright.  Especially if you're power's out.

Sources:  The Wall Street Journal article "PG&E's Big Blackout is Only the Beginning" appeared on Oct. 12, 2019 at https://www.wsj.com/articles/pg-es-big-blackout-is-only-the-beginning-11570909816.  I also referred to a New York Times article at https://www.nytimes.com/2019/10/09/us/pge-shut-off-power-outage.html.

Monday, October 07, 2019

Pilot Overload and the Boeing 737 Max Accidents


In the last couple of months, new information about the factors leading to crashes of two Boeing 737 Max aircraft and the loss of 346 lives has emerged.  All such aircraft were grounded indefinitely last March after investigators found that a software glitch combined with faulty data from airspeed indicators to start a chain of events that led to the crashes.  Airline companies around the world have lost millions as their 737 Max fleets sit idle, and Boeing has been under tremendous pressure from both international regulatory bodies and the market to come up with a comprehensive fix for the problem.  But as long as both humans and computers have to work together to fly planes, the humans will need training to deal with unusual situations that the computers come up with.  And in the case of the Lion Air and Ethiopian Air crashes, it looks like whatever training the pilots received left them inadequately prepared to deal with at least one situation that led to tragedies.

Modern fly-by-wire aircraft are certainly among the most complex mobile systems in existence today.  It is literally impossible for engineers to think of every conceivable combination of failures that pilots would have to handle in an emergency, simply because there are so many subsystems that can interact in almost countless ways.  But so far, airliner manufacturers have done a pretty good job of identifying the major failure conditions that would be life-threatening, and instructing pilots about how to deal with those.  The fact that Capt. Chesley Sullenberger was able to land a fly-by-wire Airbus A320 plane in the Hudson in 2009 after experiencing failure of all engines shows that humans and computers can work together cooperatively to deal with unusual failures.

But the ending was not so happy with the 737 Max flights, and recent news from regulators indicates that a wild combination of alarms, stick-shakings, and other distractions may well have paralyzed the pilots of the two planes that crashed after faulty readings from angle-of-attack sensors set off the alarms. 

Flying a modern jetliner is a little bit like what I am told it was like being in the army during World War II.  For many soldiers, the experience was a combination of long stretches of incredible tedium interrupted by short but terrifying bursts of combat.  It's psychologically hard for a person to remain alert and ready for any eventuality when the norm is that pretty much nothing out of the routine ever happens the vast majority of the time.  So when the unusual failure of both angle-of-attack sensors led to a burst of alarms and the flight computer's attempt to push the nose down, the pilots on the ill-fated flights apparently failed to cope with the confusion and could not sort through the distractions in order to do the correct thing.

A month after the Lion Air crash in 2018, the FAA issued an emergency order telling pilots what to do in this particular situation.  Read in retrospect, it resembles instructions on how to thread a needle in the middle of a tornado: 

            ". . . An analysis by Boeing found that the flight control computer, should it receive faulty readings from one of the angle-of-attack sensors, can cause 'repeated nose-down trim commands of the horizontal stabiliser'.  The aircraft might pitch down 'in increments lasting up to 10sec', says the order.  When that happens, the cockpit might erupt with warnings.  Those could include continuous control column shaking and low airspeed warnings – but only on one side of the aircraft, says the order.  The pilots might also receive alerts warning that the computer has detected conflicting airspeed, altitude and angle-of-attack readings. Also, the autopilot might disengage, the FAA says.  Meanwhile, pilots facing such circumstances might need to apply increasing force on the control column to overcome the nose-down trim. . . . They should disengage the autopilot and start controlling the aircraft's pitch using the control column and the 'main electric trim', the FAA say. Pilots should also flip the aircraft's stabiliser trim switches to 'cutout'. Failing that, pilots should attempt to arrest downward pitch by physically holding the stabilizer trim wheel, the FAA adds."

If I counted correctly, there are six separate actions a pilot is being told to take in the midst of a chaos of bells and whistles going off and his plane repeatedly trying to fly itself into the ground.  The very fact that the FAA issued such a warning with a straight face, so to speak, should have set off alarms of its own.  And after the second crash under similar circumstances, reason prevailed, but first with regulatory agencies outside the U. S.  Finally, the FAA complied with the growing global consensus and grounded the 737 Max planes until the problem could be cleared up.

When software is rigidly dependent on data from sensors that convey only a narrowly defined piece of information, and those sensors go bad, the computer behaves like the broomstick in the Disney version of Goethe's 1797 poem, "The Sorcerer's Apprentice."  It goes into an out-of-control panic, and apparently the pilots found it was humanly impossible to ignore the panicking computer's equivalent of "YAAAAH!" and do the six or however many right things that were required to remedy the situation. 

It is here that an important difference between even the most advanced artificial-intelligence (AI) system and human beings comes to the fore.  It is the ability of a human being to maintain a global awareness of a situation, flexibly enlarging or narrowing the scope of attention as required.  Clearly, the software designers felt that once they had delivered an emergency message to the pilot, the situation was no longer their responsibility.  But insufficient attention was paid to the fact that in the bedlam of alarms that the unusual simultaneous sensor failure caused, some pilots—even though they were well trained by the prevailing standards—simply could not remember the complicated sequence of fixes required to keep their planes in the air.

Early indications are that the 737 Max "fix," whatever software changes it involves, will also involve extensive pilot retraining.  We can only hope that the lessons learned from the fatal crashes have been applied, and that whenever such unusual sensor failures happen in the future, pilots will not have to perform superhuman feats of concentration to keep the plane from crashing itself.

Sources:  A news item about how Canadian regulators are looking at the pilot-overload problem appeared on the Global News Canada website on Oct. 5, 2019 at https://globalnews.ca/news/5995217/boeing-737-max-startle-factor/.  The November 2018 FAA directive to 737 Max pilots is summarized at https://www.flightglobal.com/news/articles/faa-order-tells-how-737-pilots-should-arrest-runawa-453443/.  I also referred to Wikipedia's articles on the Boeing 737 Max groundings, Chesley Sullenberger, and The Sorcerer's Apprentice.