Monday, March 18, 2019

Are the 737 Max 8 Crashes Single-Point Failures?


A friend of mine who formerly worked at NASA was talking about his volunteer work at his church, which is to operate the video camera that records the pastor's sermon.  He's going to ask them to buy a second camera, and when I asked him why, he said, "Single-point failure.  That camera goes out, we're up a creek without a paddle." 

The same concept that can be applied to as humble and non-life-threatening a situation as recording sermons also applies to highly complex systems such as Boeing's 737 Max 8, a new version of the popular 737 aircraft that is in service around the world.  But new evidence from the Mar. 10, 2019 crash of a 737 Max 8 outside Addis Ababa, Ethiopia in which 157 people died indicates that a single-point failure may be responsible for both that disaster and a similar crash of another 737 Max 8 on Oct. 28, 2018 in Indonesia that killed 189.

The single-point failure possibility involves a new anti-stall system called MCAS, which Boeing installed on the Max 8 version of their 737s when the two engines were moved forward compared to earlier versions.  Because this move made the aircraft more prone to stall, the MCAS system was intended to make the plane handle more like older 737s, reducing the need for extensive pilot retraining.  But evidently, pilots were not thoroughly informed that the new MCAS system was in place and activated until the Ethiopian crash brought attention to the system.

The system works by monitoring information from two sensors called angle-of-attack (AOA) sensors.  These are small fins that stick out from the side of the aircraft rather like wind vanes, and rotate to sense the direction of local airflow with respect to the fuselage.  In a stall, the plane is tilted nose-up excessively with respect to the direction of airflow.  This makes the sensor turn at an angle that the onboard computers use to figure out that it's time to take over the controls from the pilot and push the nose down.

Normally, according to a post on aviationstackexchange.com, the onboard computer takes the output of both AOA sensors into account, and if one indicates a stall and the other doesn't, perhaps just a warning is issued to the pilot.  But according to a New York Times report, the MCAS anti-stall system activates even if only one of the two sensors says the nose is too high.  If anything happens to make one of the sensors give a false reading—a stray updraft from the backwash of a flight that just took off, for example—the MCAS goes into action and pushes the nose down, even if the takeoff is proceeding normally.

The altitude records of both 737s involved in the crashes in Ethiopia and Indonesia show that the pilots went on a desperate roller-coaster ride, executing climbs and descents every half-minute or so three or four times before the final descent and crash.  This is consistent with a struggle between the MCAS and the pilots, although other causes could be responsible as well.  Following the Ethiopian crash, China and many other countries grounded all 737 Max 8 and Max 9 planes, and later last week, the U. S. followed suit.

Boeing says it is working on a software upgrade for the planes involved, but it may not be available until April, and so until then, millions of dollars' worth of aviation assets will be out of service.  But that's better than having another 737 crash on takeoff. 

It is too soon to draw definite conclusions about the causes of these crashes.  That has to wait for the analyses of black-box records and other pertinent data.  But investigators have already found that the horizontal stabilizer in the Ethiopian plane was set to push the nose down, which is not something you normally do on takeoff.  And the fact that the MCAS can be triggered by only one AOA sensor is enough reason to take measures such as grounding planes until a remedy can be developed and installed.

Planes are designed by people who work in organizations, and successful designs of safe planes emerge from an exceedingly complex process involving thousands of designers, technicians, supervisors, inspectors, regulators, and other interested parties.  Successful companies manage to evolve with new young staff replacing retired engineers and managers while maintaining the core principles and knowledge that is essential to making planes safe.  And one of those core principles, so easy to understand that even I get it, is to avoid single-point-failure situations whenever possible by installing backup systems and procedures. 

If what the Times reported is true, someone dropped the ball with regard to the MCAS system's behavior in response to only one erroneous sensor.  It could take months or years to figure out how this design error happened.  But the lesson is one that has to be learned if Boeing is to recover from this sequence of disasters, which it probably will. 

It's also possible that the accidents involved pilot error in combination with a misbehaving MCAS.  If the pilots didn't know that the MCAS was even installed, or were unfamiliar with what flying the plane with an activated MCAS is like, their actions with regard to it may have contributed to the crashes.  Part of the problem here is that the MCAS rarely activates under typical flight conditions.  Perhaps there is something about the meteorological conditions at the two airports involved which gave rise to a single AOA sensor error that persisted long enough to cause the accidents. 

These and other speculations will have to await the full accident reports, which may not be available for months.  But in the absence of more knowledge, grounding the 737 Max 8 and 9 planes until the single-point-failure problem with MCAS can be addressed and demonstrated to be fixed is the wisest course.   



Postscript:  After I posted this blog, I received an informative email from a reader who wishes to remain anonymous.  He has given me permission to post it here, as it sheds more light on the concept of single-point failure:

Dear Mr. Stephan,

     I am writing to you in order to add some of my thoughts on your recent post  'Are the 737 Max 8 Crashes Single-Point Failures?' 

     In determining a single point of failure it would seem necessary to choose a boundary for the system or sub-system and also define what the goal of the system is.  In your example the goal is to record the pastor's sermon, and the entire system consists of a video camera and an operator.  So, in this case the video camera is a single point of failure, and adding a second camera provides redundancy.

     In the case of the MCAS the overall goal is to maintain a safe flight path.  The failure of a single AOA sensor may cause the MCAS to make inappropriate changes to the horizontal stabilizer trim.  But there are other subsystems that provide redundancy, most notably the stabilizer trim cut out switches.  The proper use of these switches is a memory item for both pilot and co-pilot.  So, in this broader context, the AOA sensor may not be a single point of failure with regards to the goal of maintaining a safe flight path.

     This design was probably used as the failure to pitch down at the onset of a stall at low level may result in a condition from which recovery is impossible, while an erroneous change in horizontal stabilizer trim can be corrected by timely intervention by the pilots.

And finally, I would like to thank you for your blog.  It provides detailed and nuanced analyses of engineering problems that I have not found elsewhere.

(Name withheld)

Sources:

Boeing 737 technical site:  http://www.b737.org.uk/index.htm

Juan Browne's (a 777 pilot and an air frame and power plant mechanic) YouTube channel:

FAA Advisory Circulars:


Mentor Pilot (a 737 pilot and line training captain) YouTube Channel:
-->

No comments:

Post a Comment