People working above the line of representation continuously build and refresh their models of what lies below the line. That activity is critical to the resilience of Internet-facing systems and the principal source of adaptive capacity.
Imagine all the people involved in keeping your Web-based enterprise up and running suddenly stopped working. How long would that system continue to function as intended? Almost everyone recognizes the "care and feeding" of enterprise software systems requires more or less constant attention. Problems that require intervention crop up regularly—several times a week for many enterprises; for others, several times a day.
Publicly, companies usually describe these events as sporadic and minor—systemically equivalent to a cold or flu that is easily treated at home or with a doctor's office visit. Even a cursory look inside, however, shows a situation more like an intensive care unit: continuous monitoring, elaborate struggles to manage related resources, and many interventions by teams of around-the-clock experts working in shifts. Far from being hale and hearty, these are brittle and often quite fragile assemblies that totter along only because they are surrounded by people who understand how they work, how they fail, what can happen, and what to do about it.
The intimate, ongoing relationship between tech software/hardware components and the people who make, modify, and repair them is at once remarkable and frustrating. The exceptional reach and capacity of Internet-based enterprises results from indissolubly linking humans and machines into a continuously changing, nondeterministic, fully distributed system.
Their general and specific knowledge of how and why those bits are assembled as they are gives these humans the capacity to build, maintain, and extend enterprise technology. Those bits continuously change, creating an absolute requirement to adjust and refresh knowledge, expectations, and plans. Keeping pace with this change is a daunting task, but it is possible—just—for several reasons:
The barriers to entry into this network are low. There is not yet a formal process of training nor certification of authority found in other domains (for example, medicine). This has promoted rapid growth of the community while also creating uncertainty that manifests in hiring practices (for example, code-writing exercises).
The intimate, ongoing relationship between tech software/hardware components and the people who make, modify, and repair them is at once remarkable and frustrating.
This community of practice appears to have a distinct ethos that puts great emphasis on keeping the system working and defending it against failures, damage, or disruption. The community values both technical expertise and the capacity to function under stress; membership in the community depends on having successfully weathered difficult and demanding situations. Similarly, the collective nature of work during threatening events encourages both cooperation and support. As Lave and Wenger observed for other communities of practice, mastery here is gained via "legitimate peripheral participation."
All these features are simultaneously products of the environment and enablers of it. They have emerged in large part because the technical artifacts are evolving quickly, but moreso because the artifacts cannot be observed or manipulated directly. Computing is detectable only via representations synthesized to show its passing. Similarly, it can be manipulated only via representations.
The accompanying figure shows an Internet-facing system. The horizontal line comprises all the representations available to people working above that line, including all the displays, screens, and other output devices, and keyboards and other input devices. Below this line lie the technical artifacts: code libraries, IDEs, test suites, compilers, CI/CD (continuous integration/continuous delivery) pipeline components, and the computational capacity itself including technology stacks and services. Above the line of representation are the people, organizations, and processes that shape, direct, and restore the technical artifacts that lie below that line.
Figure. An Internet-facing system.
People who work above the line routinely describe what is below the line using concrete, realistic language. Yet, remarkably, nothing below the line can be seen or acted upon directly. The displays, keyboards, and mice that constitute the line of representation are the only tangible evidence that anything at all lies below the line.
All understandings of what lies below the line are constructed in the sense proposed by Bruno Latour and Steve Woolgar.2 What we "know"—what we can know—about what lies below the line depends on inferences made from representations that appear on the screens and displays. These inferences draw on our mental models—those that have been developed and refined over years, then modified, updated, refined, and focused by recent events. Our understandings of how things work, what will happen, what can happen, what avenues are open, and where hazards lie are contained in these models.
It will be immediately apparent that no individual mental model can ever be comprehensive. The scope and rate of change assure that any complete model will be stale and that any fresh model will be incomplete. David Woods said this clearly in what is known as Woods' theorem:4
As the complexity of a system increases, the accuracy of any single agent's own model of that system decreases rapidly.
The level of complexity below and above the line is similar. As the complexity below the line has increased, so too has the complexity above the line.
For some events troubleshooting and repair are highly localized below and above the line. When there is a one-to-one mapping from a below-the-line component to an above-the-line individual or team, the work of coordination can be small. For other events the troubleshooting and repair can be arduous because the manifestations of the anomaly are far from its sources—so far, in fact, it is unclear whose knowledge could be useful.
These events are often quite different from those in domains where roles and functions are relatively well defined and task assignment is a primary concern. Coordinating collaborative problem solving in critical digital services is the subject of intense investigation and the target of many methods and tools, yet it remains a knotty problem.
A similar argument developed around human-computer interaction in the 1970s. Efforts to treat the computer and the human operator as separate and independent entities broke down and were replaced by a description of human and computer as a "system." Large-scale distributed computing and the similarly distributed approaches to programming and operations are replicating this experience on a larger scale.
Incidents are a "set of activities, bounded in time, that are related to an undesirable system behavior."1 The decision to describe some set of activities as an incident is a judgment made by people above the line. Thus, an incident begins when someone says that it has begun and ends when someone says it has ended. Like the understanding of what lies below the line, incidents are constructed.
Knowledge and understanding of below-the-line structure and function are continuously in flux. Near-constant effort is required to calibrate and refresh the understanding of the workings, dependencies, limitations, and capabilities of what is present there. In this dynamic situation no individual or group can ever know the system state. Instead, individuals and groups must be content with partial, fragmented mental models that require more or less constant updating and adjustment if they are to be useful.
Related articles
on queue.acm.org
Continuous Delivery Sounds Great, but Will It Work Here?
Jez Humble
https://queue.acm.org/detail.cfm?id=3190610
A Decade of OS Access-control Extensibility
Robert N.M. Watson
https://queue.acm.org/detail.cfm?id=2430732
The Network's NEW Role
Taf Anthias and Krishna Sankar
https://queue.acm.org/detail.cfm?id=1142069
1. Allspaw, J., Cook, R.I. SRE cognitive work. Seeking SRE: Conversations About Running Production Systems at Scale. D. Blank-Edelman, ed. O'Reilly Media, 2018, 441–465.
2. Latour, B., Woolgar, S. Laboratory Life: The Construction of Scientific Facts. Sage Publications, Beverly Hills, CA, 1979.
3. Lave, J., Wenger, E. Situated Learning: Legitimate Peripheral Participation. Cambridge University Press, Cambridge, U.K., 1991.
4. Woods, D.D. Stella: Report from the SNAFUcatchers Workshop on Coping with Complexity. The Ohio State University, 2017; https://snafucatchers.github.io/.
Copyright held by author/owner. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org
The Digital Library is published by the Association for Computing Machinery. Copyright © 2020 ACM, Inc.
No entries found