Sign In

Communications of the ACM

Research highlights

Technical Perspective: Computation Where the (inter)Action Is

watch and soundwaves, illustration

SoundWatch is a prototype system that detects audio events and displays descriptions of them to deaf and hard-of-hearing people via the screen of their smartwatch. Beyond the system itself, SoundWatch contributes a case study of the opportunities and challenges we might expect as computation continues to move closer to where the interaction happens.

Access technology has long been a window into the future, and so we can learn a lot from prototypes like SoundWatch. As one example, speech recognition is now mainstream, but the people who have relied on it the longest are those who found it difficult to type otherwise. Mainstream user interfaces focus on a small set of modalities, whereas accessibility necessarily explores interactions beyond common ability assumptions.

Access technologies sense rather than try to understand. As an example, think about the difference between SoundWatch telling the user to "go open the door" versus alerting them that it might have heard a "doorbell." Both may lead the user to check the door, but the latter message better protects user agency and better enables a human user to make up for system limitations. If I don't have a doorbell or am not at home, I am better positioned to reason about what other sound events might be plausible if SoundWatch displays that it heard a doorbell.

Early access technologies were bulky. They also didn't work well. They were slow, inaccurate, and severely limited in what they could do. They were sometimes adopted nevertheless when they provided value over alternative approaches. Early sound recognition systems detected only a handful of sounds, plugged into the wall, and were expensive special purpose devices. These days, basic sound recognition comes standard on smartphones.

Computation, notably machine learning, has generally been moving closer to where the interaction happens. Smartphones made interaction mobile, and prototypes of wearable computers have existed for decades; commodity smartwatches might seem like a small step in comparison, but they significantly change interaction possibilities. A smartwatch screen is always glance-able, whereas a phone is often hidden away in a pocket or a bag. Phones must be intentionally carried around and are fragile, whereas a smartwatch is attached to people, making it difficult to forget or drop. Devices too bulky or too strange or too ugly get left behind (a well-known design consideration in accessibility), whereas smartwatches put computation in a 200-year-old, widely accepted form factor.

Smartwatches do not yet have the computational power necessary to be a person's only device, and ML models with good performance are big and computationally expensive. Model compression and research into efficient ML are tackling these issues. Regardless, at any given time, the best performing model for most interesting real-world problems will require more computation than the lowest-power device people regularly use. SoundWatch recognizes 20 sounds, but what if we wanted to recognize 1,000 or 10,000 sounds, transcribe speech, or be more robust to noise? Those capabilities will be available first on more powerful devices.

Computer scientists are comfortable designing architectures that trade off different computational capabilities and latencies. SoundWatch illustrates that this must now necessarily include multiple computational devices on our bodies. SoundWatch explored a network with a smartwatch, a smartphone and a remote server; what will it take to be prepared for the near future when these trade-offs must be considered relative to available computation on many wearable devices, for example, headphones, rings, contact lenses, and shoes worn by the user? The closer the computation is to the user interaction, the less powerful it likely is, which is a new human-centered trade-off where the objectives are not as easily or universally defined.

Human-computer interaction (HCI) research has a vital role to play in designing these architectures because deciding where computation should happen is not only a technical question but also a human one. The SoundWatch study provides an example. By interacting with SoundWatch, potential users were able to provide more ecologically valid feedback about which sounds needed to be detected quickly (for example, those related to safety), and which could reasonably take longer (for example, environmental sounds). A challenge not explored by SoundWatch is how to design usable systems whose performance and capabilities change when the underlying architecture changes (for example, when I have my phone versus when I don't). More HCI research is needed!

SoundWatch invites us to peer into near-term futures that will shape not only accessibility but broadly how we interact with technology. People may have been so focused waiting on a scifi vision of augmented humans that we have missed that computation is now pervasively with us. Everything from our phone, watch, left and right headphone, and many other devices are now capable of computation—what interactions could usefully be localized there, what performance will people expect, and what innovations across computer science will be needed to support them? SoundWatch is directly about improving accessibility, but prototypes like this ultimately push us to drive forward every area of computing.

Back to Top


Jeffrey P. Bigham is an associate professor in the Human Computer Interaction Institute at Carnegie Mellon University, Pittsburgh, PA, USA.

Back to Top


To view the accompanying paper, visit

Copyright held by author.
Request permission to (re)publish from the owner/author

The Digital Library is published by the Association for Computing Machinery. Copyright © 2022 ACM, Inc.


No entries found