Margo Seltzer, the Canada 150 Research Chair in Computer Systems at the University of British Columbia and 2023–2024 ACM Athena Lecturer, is the kind of researcher who stands out not just for her accomplishments, but for her tirelessness. After building a database software library that underpinned many first-generation Internet services, she worked on topics that range from file systems and storage to capturing and accessing data provenance. Here, she speaks with Leah Hoffmann about finding impactful research projects—and keeping up with everything that's going on in the field.
The story of Berkeley DB, the database software library that you built with Keith Bostic and Mike Olson, has been told before at greater length, but let me see if I can summarize. Your work on packages such as hash and B-tree was released with Berkeley Unix as the DB 1.85 library. Then, as a side project, you and Mike created a transactional storage system—which Netscape later wanted to integrate into the LDAP directory server it was building on a Berkeley DB core. That prompted you and Keith to launch Sleepycat Software and create a production-quality transactional library. Before Netscape approached you, did you guys ever think, "maybe we should commercialize, maybe there's something there"?
I had always wanted to build a database with the tool-based Unix-style philosophy. And at one of my first jobs, I was involved in a development effort that started me thinking along these lines.
Keith had also been thinking about something similar; he was reimplementing the vi editor, and he felt the right way to do it was to build it on top of a record management package. Finally, Mike Olson had built B-trees at all three of the jobs that he had before going back to grad school, so his motivation was, "I'm going to build one more B-tree, release it, and I will never have to do this again."
So, in some sense, all three of us had ulterior motives to do this.
The Netscape deal essentially provided seed money for Sleepycat—you did not raise additional funds.
People seem to think the only way to build a company is to raise $1 billion and go big, and I do not understand that. An organically grown startup is a much saner lifestyle—and if you genuinely care about the product, as opposed to getting famous and/or rich, that is absolutely the way to do it. We controlled our destiny, start to stop. And we had a really good time. To be fair, we worked our butts off, but many of our employees still describe it as the best job they ever had.
Sleepycat also pioneered the first open-source dual license.
Today, you look at open source and everybody says, "Oh, we know what that is." But that was not always the case. We had to educate the market to a large extent. And it's not clear that you could build the same kind of dual-license company today—not just because we got lucky, but because software architecture is fundamentally different. We were able to leverage the fact that using Berkeley DB required linking our code with a customer's code, and our license relied on that. This architecture appears less frequently today.
Let's talk about your work on data provenance, which suggests various practical uses for collecting information about where data comes from and how it is manipulated and stored. However, arriving at useful applications has been something of a rocky journey.
The history of data provenance is frustrating. The field dates back to a paper by Peter Buneman, who pointed out that because relational databases have rigorous, formal mathematical underpinning, you could state precisely where data comes from and why it was being produced as the result of a query. From there, the workflow community said, "If we use this to track scientific workflows, it will make things reproducible." And then my group were the fringe lunatics who said, "Actually, you can't make things reproducible unless you have provenance at the system level to understand if people installed new libraries or upgraded the compiler or changed their system in some other way."
The community then issued a series of provenance challenges to better understand the different representations used for provenance and explore their expressive power.
Yes, to their credit. However, while the scenarios were realistic, there wasn't a user community who cared. So, as a result, in my opinion, the standards that emerged did not help people who want to use provenance, who often end up having to implement their own language on top of the standard.
The other thing that's held provenance back is that the places where it's most useful, at least so far, are the kinds of places that I liken to insurance. Nobody goes out and buys insurance because they want it; they buy insurance because they have to. People don't want to "pay for" provenance (that is, add any overhead to their work); they only discover they should have when something bad happens—their system gets infiltrated, they end up with malware on it.
"People don't want to 'pay for' provenance; they only discover they should have when something bad happens—their system gets infiltrated, they end up with malware on it."
Without compelling applications, it's difficult to say people should collect this type of data.
One of my "aha" moments came during a chat with an ecologist who had been a computer scientist. I described what we were building, and she was very receptive. I asked if she would use those types of tools. She said, "Well, what would I have to do to use them?"
I said, "Basically, once the package is installed, all you really have to do is type one command before you run your analysis." And that was a deal breaker.
You've spoken a lot about how the field has been partitioned over the years, with different micro-communities that don't talk to one another. What sort of reception do you get?
I think people largely agree, and at the same time, they don't know how to push back, because if you want to get tenure, the easy thing to do is pick your little community and do really well in that community.
I detected a familiar note of desperation in some of the follow-up questions you've received—like, 'how can you expect everyone to keep up with adjacent fields when we struggle to stay on top of our own?' To which your response was, 'you don't need to be an expert in everything, just to have a vague sense of what people are working on so you can draw ideas when the opportunity arises.'
I think it's really hard for junior people to walk into a project and realize they can't control everything. By the same token, how do I know my students are doing the right things? It's hard.
You've made a similar point about the separation between academic and product communities.
Back when both communities were smaller, there was less of a sharp divide between the people who built products and the people who do research. And I understand why that's shifted, but I do worry that our research is less informed by real problems, as opposed to problems that academics find interesting.
And yet industry has very real problems, and most companies aren't large enough to have their own internal research team to solve them. What can we do to bridge that gap? Longing for the good old days when industry and academia met at conferences may not be the right solution. But I do think there's a problem, and we need to find a 2023 solution instead of the 1980 solution.
It also seems the research community's definition of practical may be a little different than the product community's. Is the goal to build something that's forward-looking or something that can be commercialized within six months?
I think that's really key. If your research could be commercialized in six months, that assumes you're building a new thing. Academics like that kind of research. But the research that has real impact on industry, that solves a problem faced by every single company, might not be a brand-new thing that's going to make you rich. And that's the kind of research that is harder to do now. In some sense, this brings us full circle. Every single data scientist has really good problems to solve, and if we can find the right way to trick them into capturing data provenance by giving them tools that make their job easier, then maybe we can do work that has real impact.
©2023 ACM 0001-0782/23/12
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and full citation on the first page. Copyright for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to lists, requires prior specific permission and/or fee. Request permission to publish from permissions@acm.org or fax (212) 869-0481.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.
No entries found