Showing posts with label Boeing. Show all posts
Showing posts with label Boeing. Show all posts

Monday, April 08, 2019

Boeing Confirms Software At Fault In Ethiopian Crash


Last Thursday, Apr. 4, Ethiopian Transport Minister Dagmawit Moges released a preliminary report into the crash of an Ethiopian Airlines Boeing 737 Max 8 outside Addis Ababa last month, killing all 157 people on board.  Cockpit voice recordings and data from the flight recorder make it very clear that, as Boeing CEO Dennis A. Muilenberg admitted regarding both this crash and that of an Indonesian Lion Air flight last fall, "it's apparent that in both flights the Maneuvering Characteristics Augmentation System, known as MCAS, activated in response to erroneous angle of attack information."  Boeing is currently scrambling to fix both that software problem and another minor one uncovered recently, but as of now, no 737 Max 8s are flying in the U. S. or much of anywhere else.  And the FBI is reportedly investigating how Boeing certified the plane.

When we blogged about the Ethiopian crash three weeks ago, there were significant questions as to whether the MCAS alone was at fault, or whether pilot errors contributed to the crash.  But according to a summary published in the Washington Post, Minister Moges said that the pilots did everything recommended by the manufacturer to disable the MCAS, which was repeatedly attempting to point the plane's nose downward in response to the single faulty angle-of-attack sensor output.  But their efforts proved futile, and the plane eventually keeled over into a 40-degree dive and crashed into the ground at more than 500 mph. 

Our sympathy is with those who lost relatives and loved ones in both crashes.  Similar words were spoken by CEO Muilenberg, on whose head lies the ultimate responsibility for fixing these problems.  In doing so, he and his underlings will be dealing with how to smoothly integrate control of life-critical systems when both humans and what amounts to artificial intelligence are involved.

This is not a new problem, but it has transformed so much over the years that it seems new. 

I once toured a museum near Lowell, Massachusetts which preserved a good number of the original pieces of machinery used in one of the many water-powered textile mills that used to dot the landscape in the early 1800s.  Attached to their main water turbine was a large, complicated system of gears, flywheels, springs, levers, and so on which turned out to be the speed regulator for the mill.  As looms were cut in and out of the belt-and-shaft power distribution system, the load would vary, but it was important to keep the speed of the mill's shafts as constant as possible.  The complicated piece of machinery I saw turned out to be a sophisticated control system that kept the wheels turning at the same rate to within a few percent, despite wide variations in load.

I'm sure that from time to time the thing might malfunction, and in that case a human operator would have to intervene, shutting it down if it started to go too fast, for example, or if continued operation endangered someone caught in a belt, say.  So humans have been learning to get along with autonomous machinery for almost two hundred years.

The difference now is that in transportation systems (autonomous cars, airplanes), timing is critical.  And because cars and planes travel into novel situations, not all of which can be anticipated by software engineers, conditions can arise which make it hard or impossible for the humans who are ultimately responsible for the safety of the craft to respond.  That increasingly seems to be what happened to Ethiopian Air Flight 302, as evidenced by the black-box data clearly showing only one angle-of-attack sensor to be transmitting flawed data. 

Such issues have happened numerous times with the limited number of autonomous cars that have been fielded in recent years.  We know of at least two fatalities associated with them, and there have probably been many more near-misses or non-fatal accidents as well. 

But even a severe car wreck can kill at most a few people.  Commercial airliners are in a differenc category altogether.  They are operated by (mostly) seasoned professionals who should be able to trust that if they follow the procedures recommended by the manufacturer (in this case, Boeing), they will be able to deal with almost any imaginable contingency, even something like a stray plastic bag jamming an angle-of-attack sensor (this is my imagination working, but something had to make it give an erroneous reading).  In the case of the Ethiopian crash, the implied promise was broken.  The pilots did what they were told would disable the MCAS, but it didn't disable, and with disastrous results.

It is unusual for a criminal investigation to be aimed at the civilian U. S. aircraft industry, whose safety record has been achieved under mostly cooperative conditions between the Federal Aviation Administration and the firms who make and fly the planes.  Obviously it is too soon to speculate about what, if anything, will turn up from such an investigation.  In teaching my engineering classes, I sometimes ask if anyone has encountered on-the-job situations whose ethics could be questioned.  And I have heard several stories about how inspection or test records were falsified in order to pass along defective products.  So such things do happen, but one hopes that in a firm with a reputation such as Boeing's, incidents like this are rare. 

The marketplace has ways of punishing firms for bad behavior which are not just, perhaps, but nonetheless effective.  With the growth of Airbus, Boeing knows it has a formidable rival for commercial aircraft, and any company with millions of dollars' worth of capital sitting idly on the ground as the 737 Max 8s wait for properly vetted software upgrades is bound to be having second thoughts about going with Boeing the next time they need some planes.  I would not want to be one of the software engineers or managers dealing with this problem, as the reputation of the company may be hinging on the timeliness and effectiveness of the fixes they will come up with. 

Boeing has been reasonably transparent about this problem so far, and I hope they continue to be up-front and frank with customers, regulators, investigators, and the public about the progress they make toward fixing these software issues.  People have been learning to get along with smart machines for centuries now, and I am confident that engineers can overcome this issue as well.  But it will take a lot of work and continued vigilance to keep something like it from happening in the future.

Sources:  The Washington Post carried the story "Additional software problem detected in Boeing 737 Max flight control system, officials say," on Apr. 4 at https://www.washingtonpost.com/world/africa/ethiopia-says-pilots-performed-boeings-recommendations-to-stop-doomed-aircraft-from-diving-urges-review-of-737-max-flight-control-system/2019/04/04/3a125942-4fec-11e9-bdb7-44f948cc0605_story.html.  I also consulted a  Seattle Times article at https://www.seattletimes.com/business/boeing-aerospace/fbi-joining-criminal-investigation-into-certification-of-boeing-737-max/ and the original report from the Transport Ministry of Ethiopia, which the Washington Post currently has at https://www.washingtonpost.com/context/ethiopia-aircraft-accident-investigation-preliminary-report/?noteId=6375a995-4d9f-4543-bc1e-12666dfe2869&questionId=7ad6fc9d-5427-415d-b719-34ad0b3fecfd&utm_term=.55ff25187605.

Monday, March 18, 2019

Are the 737 Max 8 Crashes Single-Point Failures?


A friend of mine who formerly worked at NASA was talking about his volunteer work at his church, which is to operate the video camera that records the pastor's sermon.  He's going to ask them to buy a second camera, and when I asked him why, he said, "Single-point failure.  That camera goes out, we're up a creek without a paddle." 

The same concept that can be applied to as humble and non-life-threatening a situation as recording sermons also applies to highly complex systems such as Boeing's 737 Max 8, a new version of the popular 737 aircraft that is in service around the world.  But new evidence from the Mar. 10, 2019 crash of a 737 Max 8 outside Addis Ababa, Ethiopia in which 157 people died indicates that a single-point failure may be responsible for both that disaster and a similar crash of another 737 Max 8 on Oct. 28, 2018 in Indonesia that killed 189.

The single-point failure possibility involves a new anti-stall system called MCAS, which Boeing installed on the Max 8 version of their 737s when the two engines were moved forward compared to earlier versions.  Because this move made the aircraft more prone to stall, the MCAS system was intended to make the plane handle more like older 737s, reducing the need for extensive pilot retraining.  But evidently, pilots were not thoroughly informed that the new MCAS system was in place and activated until the Ethiopian crash brought attention to the system.

The system works by monitoring information from two sensors called angle-of-attack (AOA) sensors.  These are small fins that stick out from the side of the aircraft rather like wind vanes, and rotate to sense the direction of local airflow with respect to the fuselage.  In a stall, the plane is tilted nose-up excessively with respect to the direction of airflow.  This makes the sensor turn at an angle that the onboard computers use to figure out that it's time to take over the controls from the pilot and push the nose down.

Normally, according to a post on aviationstackexchange.com, the onboard computer takes the output of both AOA sensors into account, and if one indicates a stall and the other doesn't, perhaps just a warning is issued to the pilot.  But according to a New York Times report, the MCAS anti-stall system activates even if only one of the two sensors says the nose is too high.  If anything happens to make one of the sensors give a false reading—a stray updraft from the backwash of a flight that just took off, for example—the MCAS goes into action and pushes the nose down, even if the takeoff is proceeding normally.

The altitude records of both 737s involved in the crashes in Ethiopia and Indonesia show that the pilots went on a desperate roller-coaster ride, executing climbs and descents every half-minute or so three or four times before the final descent and crash.  This is consistent with a struggle between the MCAS and the pilots, although other causes could be responsible as well.  Following the Ethiopian crash, China and many other countries grounded all 737 Max 8 and Max 9 planes, and later last week, the U. S. followed suit.

Boeing says it is working on a software upgrade for the planes involved, but it may not be available until April, and so until then, millions of dollars' worth of aviation assets will be out of service.  But that's better than having another 737 crash on takeoff. 

It is too soon to draw definite conclusions about the causes of these crashes.  That has to wait for the analyses of black-box records and other pertinent data.  But investigators have already found that the horizontal stabilizer in the Ethiopian plane was set to push the nose down, which is not something you normally do on takeoff.  And the fact that the MCAS can be triggered by only one AOA sensor is enough reason to take measures such as grounding planes until a remedy can be developed and installed.

Planes are designed by people who work in organizations, and successful designs of safe planes emerge from an exceedingly complex process involving thousands of designers, technicians, supervisors, inspectors, regulators, and other interested parties.  Successful companies manage to evolve with new young staff replacing retired engineers and managers while maintaining the core principles and knowledge that is essential to making planes safe.  And one of those core principles, so easy to understand that even I get it, is to avoid single-point-failure situations whenever possible by installing backup systems and procedures. 

If what the Times reported is true, someone dropped the ball with regard to the MCAS system's behavior in response to only one erroneous sensor.  It could take months or years to figure out how this design error happened.  But the lesson is one that has to be learned if Boeing is to recover from this sequence of disasters, which it probably will. 

It's also possible that the accidents involved pilot error in combination with a misbehaving MCAS.  If the pilots didn't know that the MCAS was even installed, or were unfamiliar with what flying the plane with an activated MCAS is like, their actions with regard to it may have contributed to the crashes.  Part of the problem here is that the MCAS rarely activates under typical flight conditions.  Perhaps there is something about the meteorological conditions at the two airports involved which gave rise to a single AOA sensor error that persisted long enough to cause the accidents. 

These and other speculations will have to await the full accident reports, which may not be available for months.  But in the absence of more knowledge, grounding the 737 Max 8 and 9 planes until the single-point-failure problem with MCAS can be addressed and demonstrated to be fixed is the wisest course.   



Postscript:  After I posted this blog, I received an informative email from a reader who wishes to remain anonymous.  He has given me permission to post it here, as it sheds more light on the concept of single-point failure:

Dear Mr. Stephan,

     I am writing to you in order to add some of my thoughts on your recent post  'Are the 737 Max 8 Crashes Single-Point Failures?' 

     In determining a single point of failure it would seem necessary to choose a boundary for the system or sub-system and also define what the goal of the system is.  In your example the goal is to record the pastor's sermon, and the entire system consists of a video camera and an operator.  So, in this case the video camera is a single point of failure, and adding a second camera provides redundancy.

     In the case of the MCAS the overall goal is to maintain a safe flight path.  The failure of a single AOA sensor may cause the MCAS to make inappropriate changes to the horizontal stabilizer trim.  But there are other subsystems that provide redundancy, most notably the stabilizer trim cut out switches.  The proper use of these switches is a memory item for both pilot and co-pilot.  So, in this broader context, the AOA sensor may not be a single point of failure with regards to the goal of maintaining a safe flight path.

     This design was probably used as the failure to pitch down at the onset of a stall at low level may result in a condition from which recovery is impossible, while an erroneous change in horizontal stabilizer trim can be corrected by timely intervention by the pilots.

And finally, I would like to thank you for your blog.  It provides detailed and nuanced analyses of engineering problems that I have not found elsewhere.

(Name withheld)

Sources:

Boeing 737 technical site:  http://www.b737.org.uk/index.htm

Juan Browne's (a 777 pilot and an air frame and power plant mechanic) YouTube channel:

FAA Advisory Circulars:


Mentor Pilot (a 737 pilot and line training captain) YouTube Channel:
-->