Como o software de controle de dispositivo é verificado quanto à qualidade?

5

Muitos, many reports on the now infamous MCAS suggest a software patch is forthcoming in the next several weeks that will address the confounding single point of failure with the two AOA sensors that probably contributed to at least one 737 MAX crash.

As a software developer (of a much different variety than a Boeing engineer), I am very curious:

  1. What do regulators and the public know about the actual code that flies planes (and sometimes crashes them)?

I've been looking for an hour and can't locate a single patent application that even mentions what language it's written in, though I imagine it's a complex network of machine languages and so forth. I tend to doubt the 737 runs on JavaScript...

Mais importante:

  1. Is there any procedure for QA'ing an airline manufacturer's code?

  2. Does someone at the FAA get invited to Boeing's private Github account?

I'm mostly being facetious, but you get the idea. I ask because every day it sounds more and more like a freshman error in code cost 300+ lives.

  1. Is there anywhere I can look for descriptions of this mysterious code and its forthcoming patch?
por Chris Wilson 16.03.2019 / 16:27

3 respostas

I've worked on some certified software in a different domain (and slightly less strict); the principle would be similar here.

Basically the certifying authority will want to make sure that proper risk analysis was done on the system—both software and hardware—and risks were mitigated to get to certain level of reliability.

The authority does not have programmers to actually review the code. They will be checking the requirements, the test coverage, records that all procedures were followed. All code must be traceable to requirements that caused it to be written, tests that cover it, there must be records it was reviewed, that all the tests passed. All requirements must have their tests, then there have to be various stress tests, exploratory tests, the test plans have to be reviewed. There will be various static checks too and guidelines to follow. It is a massive amount of paperwork to make sure risks were not missed.

And then the required estimated mean time between failures for critical systems is so insanely low—IIRC for critical systems the mean time between critical failures must be (estimated) at least 10⁹ hours—that there is simply no way to demonstrate it by tests alone. Even the hardware won't have that reliability—it is hundred thousand years. So the system has to be designed so that it can fail safe and fail over to a backup and then the mean time between failure of all the systems during single flight is what gets above the target.

So there is procedure, but the maker does it and the certification authority only verifies that it was followed.

The safety critical code is indeed usually written in C, or sometimes C++ or Ada. However, safety critical, and any other hard real-time, code must be written with use of static memory only. This eliminates large class of problems that C code is normally known for, and the simplicity of C makes it easier to statically check—and the makers of critical software have powerful advanced tools for doing it—, so it is actually quite good fit.

The code still actually tends to be pretty ugly. It is hammered heavily with the tests, but when fixing issues found, preference is to do smaller changes to avoid breaking other parts that already passed the tests, which adds to its ugliness. Nobody thinks its perfect either, but that's not the point. The hardware is not perfect either, so the backups, fail-safes and fail-overs are designed to handle both kinds of failures.

And note that the current Boeing issue is actually a problem with the requirements. The trim system may run away for example if the switch gets stuck, which is always a risk with mechanical switch, so there must be a cut-off switch to disconnect it. And, judging from the public information only, it seems whoever designed this system argued that that cut-off switch is sufficient to handle failure of the new addition too, so the system was not considered critical and fault detection was not included in the requirements for it. But in doing so they apparently underestimated the human factors involved—specific details will be subject to analysis in the investigation and it is what usually takes so long.


Atualização: O artigo about the recent B38M crash (ET-302) links to an FAA document FAA and Industry Guide to Product Certification that gives overview of the whole process. It is not particularly about software, but the general process still applies to that.

18.03.2019 / 20:51

All software running on certified equipment will have to follow RTCA DO-178C.

This is a fairly complex and expensive process that includes audits specifically with the certification authority (FAA, EASA) throughout the product life cycle, including inspections of requirements, design, implementation, and testing.

Any code developed is going to be closely guarded as trade secrets, so it's is highly unlikely you will ever see actual product code unless you work for a company actually developing the software or are the certification authority auditor(s).

A majority of safety critical code is going to be developed in in either assembly (target specific), C, C++, or Ada.

18.03.2019 / 18:46

Is there any procedure for QA'ing an airline manufacturer's code?

DO-178C requires developing a rigourous and complex quality assurance process. This is handled largely through both internal auditors and QA engineers, who are not just the software designers wearing different hats. It's also handled through designated certification representatives who do audits, review certification submission documents, and approve unresolved bug/problem reports. Many companies seem to follow standards such as ISO-9000 or CMMI.

This QA process must include audits as explained in FAA Order 8110.49A, called "stage of involvement" audits. These happen throughout the development lifecycle, not just on the final product, and cover, well, everything from whether your planning is adequate to whether requirements match the code to whether your tests meet standards.

Does someone at the FAA get invited to Boeing's private Github account?

What's actually submitted to the FAA is limited, in part because the FAA's main goal during certification is to ensure the airplane was developed with a high-quality process. The FAA relies a lot on the internal auditors and certification liasons, who have access to the code and requirements. The FAA does see a lot of high-level reports like TSO submittals, a Software Accomplishment Summary, System Safety Analysis, etc. They also see a list of any open bug reports that actually affect the cockpit. Finally, the FAA is involved in a limited way in the flight testing of the plane.

The FAA rarely reviews the actual code. As selectstriker2 pointed out in his/her answer, a lot of aircraft code is considered trade secrets, and some considerations like export controls may apply. Even avionics suppliers and airframe developers are reluctant to share too much data with each other to prevent trade secret theft.

I can't locate a single patent application that even mentions what language it's written in

Vejo Quais linguagens de programação são usadas para equipamentos a bordo de aeronaves?. I understand your frustration as I've found it hard to research technology other companies are using myself.

Outros pontos

DO-178 has some spots where it relies on the certifying company to be responsible, despite the fact that financial incentives exist to produce "just good enough" software. For example the FAA assumes the company has good design (including human factors), adequate training, strong investment in QA with limited corner-cutting, and no deliberate misrepresentations.

Most importantly, the FAA aims to ensure the software development processo is unlikely to produce uncaught errors, not that the software itself is completely bug-free. Many news reports about the recent 737 Max crash sensationalize this as "delegation" or even "self-auditing" but that's how the FAA has always handled avionics- they can't exhaustively test every feature themselves, and largely rely on the manufacturer to decide the complexities of what is and isn't safe. While the FAA has some recognized design standards like TSO's, these standards are very general. In some ways this makes sense- the FAA can't get caught in a cat-and-mouse game of updating their own design requirements for every new variation or feature in avionics.

Among the other certification requirements that apply to your question:

  • A safety analysis with calculations or models showing that issues (say, undetected nose-down below 1000 ft AGL) are not more likely to occur than there severity requires. These models are often based on redudancy levels and malfunction probabilities (e.g. both AOA sensors will only fail once per 10^-9 flight hours) (see ARP4761).
  • Both system and software requirements, with corresponding system and software tests. As an analogy, you're making sure the house is not only built according to blueprints but that the blueprints make sense for a house. This relates to the point @user71659 brought up about validating the design in context of the plane and not just verifying the code has no errors.
  • Thorough testing, including testing safety features against sensor failures, testing lines of code to an MCDC coverage level, and structural coverage testing to ensure unintended interactions don't occur
  • Configuration management, including change review and problem tracking. If you're following DO-178C you can't merge in an unreviewed code change or hide a bug report
  • Tool qualification to make sure the build environment, automatic tests, etc. are well-designed and stable. See DO-330 for more.

This is a wide topic and there's a variety of certification documents that apply to you question, including DO-178C, ARP4754A, and ARP4761. Whole books have been written on industry practices so if you wish to understand this question in detail I'd suggest you pick up one (my go-to is Leanna Rierson's "Developing Safety Critical Software").

18.03.2019 / 21:07