Once again, bad software has struck. From 7:30 to late afternoon on November 10, Internet access and email were unavailable to most customers of Swisscom, the main mobile services provider in Switzerland. (If you read German, you can read details here. If you read French, see here, requiring free registration. In Italian, here.) Given how wired our lives have become, such outages can have devastating consequences. As an example, customers of some of the largest banks in Switzerland cannot access their accounts online unless they type in an access code, one-type-pad style, sent to their cell phone when they log in.
That is all the news we will see: Something really bad happened, and it was due to a software bug. A headline for a day or two, then nothing. What we will miss, in this case as with almost all software disasters—most recently, the Great Pre-Christmas Skype Outage of 2010—is the analysis: what went wrong, why it went wrong, and what is being done to ensure it does not happen again. Systematically applying such analysis is the most realistic technique available today for breakthrough improvements in software quality. The IT industry is stubbornly ignoring it. It is our responsibility as software engineering professionals to change that self-defeating and unjustifiable attitude.
I have harped on this theme before [1, 2, 3] and will continue to do so until the attitude changes. Quoting from [1]:
Airplanes today are incomparably safer than 20, 30, 50 years ago: 0.05 deaths per billion kilometers. That’s not by accident.
Rather, it’s by accidents.
What has turned air travel from a game of chance into one of the safest modes of traveling is the relentless study of crashes and other mishaps. In the U.S. the National Transportation Safety Board has investigated more than 110,000 accidents since it began its operations in 1967. Any accident must, by law, be analyzed thoroughly; airplanes themselves carry the famous “black boxes” whose only purpose is to provide evidence in the case of a catastrophe. It is through this systematic and obligatory process of dissecting unsafe flights that the industry has made almost all flights safe.
Now consider software. No week passes without the announcement of some debacle due to “computers”—meaning, in most cases, bad software. The indispensable Risks forum [4] and many pages around the Web collect software errors; several books have been devoted to the topic.A few accidents have been investigated thoroughly; two examples are Nancy Leveson’s milestone study of the Therac-25 patient-killing medical device [2], and Gilles Kahn’s analysis of the Ariane 5 crash, which Jean-Marc Jézéquel and I used as a basis for our 1997 article [6]. Both studies improved our understanding of software engineering. But these are exceptions. Most of what we have elsewhere is made of hearsay and partial information, and plain urban legends—like the endlessly repeated story about the Venus probe that supposedly failed because a period was typed instead of a comma, most likely a canard.
Part of the solution is to use the legal system. For any large-scale software failure in which public money is involved, a law should require the convocation of an expert committee and the publication of a detailed technical analysis. The software engineering community should lobby for the passage of such a law and should not rest until it is enacted.
For private businesses the legal approach may be harder to pursue as some might view it as undue government interference, but it may still be pushed given the obvious public interest in software that works. The scenario would be for the industry to adopt, as a voluntary standard, the principle that every large-scale mishap must automatically lead to an exhaustive and public post-mortem analysis; in Rahm Emanuel's immortal words, "You never want a serious crisis to go to waste."
Until that happens, software will remain brittle. Think of the last time you stepped into a plance, and how different you would have felt if aircraft manufacturers had been allowed, disaster after disaster in the past 70 years, to keep the embarrassing details to themselves and continue business as usual.
References
[1] The one sure way to advance software engineering, 21 August 2009, see here (in my personal blog).
[2] Dwelling on the point, 29 November 2009, see here.
[3] Analyzing a software failure, 24 May 2010, see here.
[4] Peter G. Neumann, moderator: The Risks Digest Forum on Risks to the Public in Computers and Related Systems, available online (going back to 1985!).
[5] Nancy Leveson: Medical Devices: The Therac-25, extract from her book Safeware: System Safety and Computers, Addison-Wesley, 1995, available here.
[6] Jean-Marc Jézéquel and Bertrand Meyer: Design by Contract: The Lessons of Ariane, in Computer (IEEE), vol. 30, no. 1, January 1997, pages 129-130, also available here.
Image source: National Transportation Safety Board, reconstruction of crashed TWA 800 aircraft, public domain (see here).
There is a difference between the investigation of aircraft crash and that of a software failure. Software, as it is developed today, has far more complex components affecting it, from hardware to compiler. The algorithmic part of development is just a small part in the overall picture. Add to that the fact that the software may have been developed using proprietary tools that may never become available to the investigators because of intellectual property issues.
I hate to sound like a naysayer here but I do not believe that the investigation of software failure can approach anywhere close to that followed by National Transportation Safety Board, unless it can be proven that lives are at stake or something equivalent to the hypothesis that the paying public will stop flying because of accidents.
Sanjiv K. Bhatia
sanjiv@acm.org
I think Sanjiv is way wrong...nuclear accident investigation, shuttle disaster investigation all involved complex technology and proprietary software!
The better analogy is marine accidents investigation.
The key issue is that like bank fraud there is a public perception for the commercial entity and a risk of loss of brand and PR fiasco if it were all revealed .. I remember when someone leaked some windows source code and I heard people saying how appalled they were at the sloppy coding...All the technical issues can be overcome by a regulator who publishes reports with either commercially sensitive information removed or sufficently anonymised..recall most banks are software dependent and refuse to disclose breaches of security that are merely financial ..in the uk data protection regulator forces reporting of loss of private data..so it can easily be done but what is at stake is weight of commercial interest in avoiding disclosure..what is required is a public interest argument that might sway policy makers
I strongly suggest examining open-source software projects, which conduct failure analysis (and much else besides) on open mailing lists. Of course, open-source software projects are bug-free: Any non-trivial program contains bugs. However, it would be quite hard to take Meyer's legal proposal seriously unless it is backed by a thorough understanding of the extensive public failure-analysis experience gained in the many open-source software communities.
Paul E. McKenney, Ph.D.
IBM Distinguished Engineer
Linux-Kernel RCU Maintainer
Many reputable web companies already do this voluntarily. See Foursquare and Github outages as examples [1,2].
Perhaps what we need is better education among software consumers that if a software company isn't talking about their software errors, it's because they're keeping them a secret, not because they don't happen.
[1] - http://blog.foursquare.com/2010/10/05/so-that-was-a-bummer/
[2] - https://github.com/blog/744-today-s-outage
There is too much trust in government, and the NTSB gets too much credit. Without them, aircraft manufacturers would have had a much more demanding regulator: free market abandonment. Instead, WE have paid for the disaster investigations THEY would have had to pay. And the NTSB sometimes blcks the reparations to disaster victims, by signing off on dubious causes and technospeak. Like TWA 800.
--The market has said the cost is not worth better software. I wish there was a stronger refactor bias in my own IT shop, but it's a HUGE cost they balance with other business priorities.
-- Alan Cassidy
Displaying all 5 comments