acm-header
Sign In

Communications of the ACM

Practice

Taking Flight with Copilot


GitHub Copilot logo

Credit: Andrij Borys Associates, Shutterstock

back to top 

In pair programming, two developers write code together. One takes the role of driver, writing the code needed for the task at hand, while the other assumes the role of navigator, directing the driver's work and reviewing the code. This allows the driver to focus on detailed coding—syntax and structure—while letting the navigator direct the flow of the work and review the code in real time. Proponents of pair programming say it improves code quality and readability and can speed up the reviewing process.5 To date, effective pair programming has required the complex coordination of getting two programmers to work together synchronously. This has made it challenging for teams to adopt this approach at scale. The emergence of new AI-powered tools to support programmers has shifted what it means to pair program.

GitHub Copilot is an AI-powered developer tool leading this shift. GitHub released Copilot in a complimentary technical preview on June 29, 2021, letting hundreds of thousands of developers try coding with an AI pair programmer. Copilot became generally available as a paid product on June 21, 2022. This article focuses on the earliest releases of Copilot—the free technical preview—because these allowed us to capture some of the first experiences with an AI pair programmer. While some changes to Copilot have been made in the version released in 2022, the user interface and experience is largely the same.

With developers taking the role of navigator, they can direct the detailed development work and review the code as it is being written. In addition, the AI assistant can write code (directed by the developer navigator) much faster than a peer, potentially speeding up the process.

Copilot received public attention quickly, generating conversation in forums, press, and social media. These impressions ranged from excitement about potentially boosting developers' productivity to apprehension about AI replacing programmers in their jobs.

But what were developers' actual experiences with Copilot, beyond the hype found in top tweets, Hacker News, or Reddit posts? During the early days of the technical preview, we investigated the initial experiences of Copilot users. This provides an unique opportunity to watch how developers would use it, as well as what challenges they encountered. While most of the observations were expected, there were some surprises as well. This article presents highlights from these initial empirical investigations. We should note the models and technology used to develop Copilot are changing rapidly. This analysis was current as of January 2022. Although some features of the tool have evolved, the general themes discussed here still hold true at the date of publication. These highlights include:

  • The diverse ways that developers are using Copilot, including things that worked great and things that very much didn't.
  • Challenges developers encountered when using the technical preview of Copilot, yielding insights into how to improve AI pair-programming experiences.
  • How AI pair programming shifts the coding experience and reveals the importance of different skills—for example, the emerging importance for developers to know how to review code as much as to write code.
  • A discussion of opportunities that AI presents to the software development process and its potential impact.

We conducted three studies to understand how developers use Copilot:

  1. An analysis of forum discussions from early Copilot users.
  2. A case study of Python developers using Copilot for the first time.
  3. A large-scale survey of Copilot users to understand its impact on productivity.

We will discuss each of these studies (summarized in Table 1).

t1.jpg
Table 1. Summary of studies included in this article.

Back to Top

What Is Copilot and How Does It Work?

At its core, Copilot consists of a large language model and an integrated development environment (IDE) extension that queries the model for code suggestions and displays them to the user. You may already have interacted with such models when writing documents or text messages. These models have been trained on billions of lines of text and have developed the ability to predict, with high accuracy, what you are going to type next.

The important difference here is that Copilot uses Codex, a language model trained on source code instead of text (for example, email, text messages, websites, or documents). This source code came from a large portion of the public code on GitHub, the largest repository of source code in existence.

According to OpenAI, the team that built Codex, the model has been trained on more than 159GB of Python code alone, in addition to code from many other languages. As such, it's quite common for code that a developer is writing to be like some combination of pieces of code that Copilot has seen before (during model training). Copilot recognizes these similarities and offers suggestions based on the similar pieces of code on which it was trained. Copilot, however, doesn't make suggestions to developers by making copies of code that it has seen previously. Rather, Copilot generates new suggestions (some of which may not actually exist in any code base) by synthesizing what it has observed in billions of lines of code during model training.

While Copilot leverages this very large model, more than a high-quality code-suggestion engine is required to help developers be more productive. Copilot has been incorporated into multiple IDEs in a way that makes code suggestions timely and seamless. As you write code, requests are continuously sent to Copilot's AI model, which is optimized to analyze the code, identify useful suggestions, and send them back in fractions of a second so that they can be offered to developers in their IDEs when needed, as shown in the accompanying figure.

uf1.jpg
Figure. Example of screenshot of IDE with Copilot code completion.

The experience developers encounter is like the existing autocompletion that has been in modern IDEs for years. The difference is that these suggestions may be longer, sometimes spanning multiple lines of code, and ideally more contextually relevant and helpful. Copilot can currently be used in VS Code, Visual Studio, Neovim, and JetBrains IDEs, including PyCharm and IDEA.

During these initial investigations of Copilot, it became evident that it was unclear to developers how it worked. Among the most common misunderstandings was that while Copilot does learn from code, the learning happens during a general training phase, where OpenAI trains a general-purpose programming model (Codex) that is fine-tuned by using a selected set of public codebases on GitHub. (For more about Codex, a model fine-tuned by OpenAI, refer to https://openai.com/blog/openai-codex/)

In contrast, when a developer uses Copilot to generate suggestions during a coding session, the code sent to Copilot to generate the suggestion is not used to make suggestions to other developers. Simply put, this model will not "leak" code to anyone else. In fact, such Web requests are ephemeral, and the code is not logged, captured, or recorded (unless users agree to collection).

Back to Top

Why Are People Using Copilot?

To get a better understanding of how people were using Copilot, we collected and analyzed users' self-reported experiences to uncover the challenges they faced and the opportunities this tool presented them, as well as to identify how the technology could evolve. The Copilot Discussion forum, available to all technical preview users, was a valuable resource. In this forum, Copilot users provided code snippets, links to blogs, and even live coding videos. An analysis of all 279 posts available, which were devoted to general feedback and personal showcases, uncovered strengths and challenges of Copilot, as well as uses that extend beyond traditional coding tasks.

Among the strengths of Copilot highlighted by users: It helped them adapt to different writing styles, generate simple methods and classes, and even infer which methods to write based on comments. This flexibility is largely a result of the context used by Copilot to make its recommendations. It uses information about the currently active file and code region to prompt the Codex model and adapt the recommendations to developers. Users reported that Copilot can adapt to different coding styles (that is, naming, formatting). In one instance, Copilot was able to make recommendations in someone's own custom programming language. The ability to work with multiple languages was also cited as a strength.

Challenges that Copilot users raised included the risk of revealing secrets such as API keys and passwords or suggesting inappropriate text. This feedback was addressed during the technical preview by adding a filter to Copilot that removed problematic suggestions. Users reported that Copilot sometimes does not write "defensive code"—for example, when Copilot writes a complete method that takes a parameter, it may not check if pointers are null or if array indices are negative. In some cases, users found Copilot suggestions to be distracting and, as such, requested keyboard shortcuts for turning Copilot on/off or muting it for a period. Users also asked that the context should take more than the current file into account.

In addition to traditional AI pair programming (that is, code completion that people think of when using AI in a programming environment), Copilot was used for tasks that extended well beyond this scenario. A few interesting use cases emerged where participants used Copilot to help them learn a new programming language, set up a quick server, calculate math problems, and assist in writing unit tests and regular expressions.

These first examples are related to coding but are much more powerful than simple line completion. In fact, one developer passed two coding skills tests without prior knowledge of complex code in the tested languages by using knowledge of other languages and leveraging Copilot for support. Several developers used Copilot to translate text messages from one language to another—for example, from English to French. Another developer connected Copilot with a speech-recognition tool to improve accessibility by building a code-as-you-speak feature. The range of user experiences we observed during the technical preview of Copilot are summarized with examples in Table 2.

t2.jpg
Table 2. Early experiences with Copilot.

Some early users had questions about how Copilot creates its code suggestions. They expressed uncertainty about licensing the code generated by Copilot. Some believed aspects of Copilot, such as code suggestions, might be required to be released under open licenses because the code used to train Codex (the AI model powering Copilot's suggestions) was potentially subject to open source license. Others chimed in with different perspectives, overall expressing different opinions on whether Copilot's suggestions should be released under open licenses or if used by developers, could pose license compliance issues based on inclusion of code suggestions. The rationale for these opinions varied: Some believed that open source licenses applicable to code used to train Copilot somehow applied; others pointed to copyright law; and still others posited a "moral basis" for giving back to open source developers.

There were also discussions about how copyright applied to Copilot's code suggestions. Some users questioned whether code suggested by Copilot would be protected by copyright, and if so, who might assert the copyright—GitHub, a developer using Copilot on a project, or even owners of code used to train Copilot (in cases where Copilot's suggestions matched the code used to train Copilot). The discussions revealed that many developers may be unfamiliar with global copyright laws. Similarly, the discussions revealed that developers have differing perspectives about whether code generated by use of an AI model should be capable of copyright protection.

Readers should note this section summarizes the comments in Copilot Discussion forum posts and that users did not use precise terms. Also note this summary presents a synopsis of the posts as they were observed and does not represent any personal opinions of the authors or official stance of the authors' employers. Finally, note that while we have endeavored to summarize the comments to our best ability, we are not legal professionals, and this summary is not intended to serve as a legal review of these topics.

Back to Top

How Are First-Time Users Engaging with Copilot?

Findings from the forum analysis inspired several questions about Copilot use in practice. Thus, to better understand how developers engage with this tool in situ, we conducted an in-depth case study with five professional Python developers, strategically selected as external industry developers who had not interacted with Copilot before. The study was conducted according to the following protocol:

  1. Participants were walked through a short demonstration of Copilot.
  2. After a brief description of the requirements, participants were asked to implement a command-line tic-tac-toe game.
  3. Participants were asked to implement a "Send email" feature.
  4. A post-study reflective interview was conducted to assess the participants' overall perspective on Copilot.

The order of these tasks was chosen strategically: First to provide guidance to the participants on how to use Copilot; then have them use the tool to independently build their own foundation and mental models of how things work; and finally, to create a scenario where they would likely need to look up an API they do not use often. We hypothesized that recalling the correct use of APIs is where Copilot can provide additional value because it combines code completion, that is, code that developers would likely be able to provide themselves, as well as additional information that developers might typically need to search for (for example, by searching Stack Overflow). Additional details can be found in the "Study Protocol" sidebar.

At the completion of the case study, researchers convened to discuss participant responses and common themes that emerged. Participants' experiences ranged from enjoyment, guilt, skepticism, clarity on how to interact with the tool, and opportunities for AI pair programming to evolve.

In terms of enjoyment, many participants expressed overall amazement during their first interactions with the tool: "Wow. That saved me a heck of a lot of time. Yeah, I think I would've tried to do it all in one line and throw some line breaks in there. But this looks better and it's actually a bit more readable."

Following this first interaction, the light guilt of having this tool at their disposal set in for some participants. One described it as: "This is pretty cool. At the same time, [Copilot can] make you a little lazy when thinking about the logic you want to implement. You'll just focus on that logic, which it's providing to you, instead of thinking of your own [logic]."

A core highlight from this case study was the early indicator that developers using AI-assisted tools often spend more time reading and reviewing code than writing, findings supported by other investigations.4 The insights from study participants also highlighted some promising areas to investigate in the future, such as including more context around suggestions presented (for example, which suggestion optimizes for readability, performance, or conciseness), as well as code provenance.

Finally, all the participants remarked that they felt Copilot contributed to their productivity. This led to further investigation and the final study covered in this article.

Back to Top

How Does Copilot Impact Productivity?

To better understand how Copilot usage can impact productivity directly, we conducted a survey with users about their perceived productivity and compared it with directly measurable usage data. (For a full description and findings of this study, see Ziegler et al.7)

Survey participants were those users in the technical preview who opted in to receive communications. Of the 17,420 surveys sent, 2,047 responses were received between February 10, 2022, and March 12, 2022. The survey included questions along several dimensions of productivity based on the SPACE framework.2

Analysis of their responses showed that users' perceived productivity correlated to 11 usage metrics, including persistence rate (percentage of accepted completions that remained unchanged after various intervals); contribution speed (number of characters in accepted completions per hour); acceptance rate (percentage of shown completions accepted by the user); and volume (total number of completions shown to the user). The paper includes a full list of the adapted metrics used. Across all analyses, acceptance rate has the highest positive correlation to aggregate perceived productivity (ρ = 0.24, P < 0.0001)—higher than persistence measures.7 This finding implies that users perceive themselves to be more productive when accepting suggestions from Copilot, even if they must edit the suggestion. That said, this presents opportunities to explore how suggestions that help users think and tinker may be more valuable to them than ones that save them typing time.6

Back to Top

Discussion

Copilot is one of the first widely used developer tools powered by AI models, offering a notable shift over previous methods like genetic programming.3 Over the next five years, however, AI-powered tools likely will be helping developers in many diverse tasks. For example, such models may be used to improve code review, directing reviewers to parts of a change where review is most needed or even directly providing feedback on changes. Models such as Codex may suggest fixes for defects in code, build failures, or failing tests. These models can write tests automatically, helping to improve code quality and downstream reliability of distributed systems.

AI models could help refactor large, complex code bases, making them easier to read and maintain. Code comments or documentation may be automatically generated using these models. In short, any developer task that requires interacting or reasoning about code in any way can likely be aided by AI. The challenge will come in creating the right user experience such that the developer is helped more than hindered.

This study of Copilot shows that developers spend more time reviewing code (as suggested from Copilot or similar tools) than writing code. As AI-powered tools are integrated into more software development tasks, developer roles will shift so more time is spent assessing suggestions related to the task than doing the task itself (for example, instead of fixing a bug directly, a developer will assess recommendations of bug fixes).

In the context of Copilot, there is a shift from writing code to understanding code. An underlying assumption is that this way of working—looking over recommendations from tools—is more efficient than doing the task directly without help. These initial user studies indicate that this is true, but this assumption may not always hold for varying contexts and tasks. Finding ways to help developers understand and assess code—and the context within which that code executes and interacts—will be important. This naturally leads to the need to understand the dynamics between developers and AI-powered tools.

An increasingly important topic of consideration is the trust that developers have in AI-powered tools like Copilot. The success of any new development tool is tied to how much a developer trusts the tool is doing the "right thing." What factors are developers finding important to build that trust? How does trust get reconstructed after AI-powered developer tools perform in an unexpected manner? For example, if code suggested by Copilot introduces security vulnerabilities or performance bottlenecks, its use will rapidly decline.

"Traditional" tools such as compilers or version control systems are largely deterministic and predictable. When problems occur, developers can examine and even modify the source code to understand any unexpected behavior. That is simply not the case with AI-powered tools. In fact, AI deep-learning models are probabilistic and more opaque. Further research is needed to better understand how tools leveraging these AI models can be designed to foster developer trust, leading to a measurable positive impact with developers.

Finally, it will become important to track AI-generated code throughout the software development life cycle, as this will help answer important questions: Does AI-generated code lead to fewer (or more?) build breaks, test failures, or even post-release defects? Should such code have scrutiny during code review? What proportion of shipping code comes from tools such as Copilot?

The answers to these questions are important to all stakeholders of a software organization but answering them requires knowing where each line of code comes from. Unfortunately, these answers are unknown right now: The provenance of generated code doesn't live past a single development session in an IDE. There is no way to know which code checked into a git repository comes from an AI tool. Developing provenance tools that can track generated code end to end from IDE to deployment will be critical for software organizations to make informed decisions about when, where, and how to incorporate AI-powered tools into their development.

Back to Top

Conclusion

This research on Copilot provided early insight into how AI-powered tools are making an entrance into the software-development process. Likewise, these studies have also presented new research questions that warrant further investigations for AI-powered tools overall.1 We hope these early findings inspire readers to consider what this can mean for the nature of collaboration for their work in the future and its potential impact. The sidebar "Research Pointers Hot Off the Press" lists some recent work on AI-powered programming.

Acknowledgments. We thank the Copilot team for great discussions and our study participants for offering great insights.

Back to Top

References

1. Ernst, N.A., Bavota, G. AI-driven development is here: Should you worry? IEEE Softw., 39, 2, (2022), 106–110; https://ieeexplore.ieee.org/document/9713901.

2. Forsgren, N., Storey, M.A., Maddila, C., Zimmermann, T., Houck, B., Butler, J. The SPACE of developer productivity: There's more to it than you think. queue 19, 1 (2021), 20–48; https://queue.acm.org/detail.cfm?id=3454124.

3. Sobania, D., Briesch, M., Rothlauf, F. Choose your programming copilot: a comparison of the program synthesis performance of GitHub Copilot and genetic programming. In Proceedings of the 2022 Genetic and Evolutionary Computation Conference, 1019–1027; https://dl.acm.org/doi/10.1145/3512290.3528700.

4. Vaithilingam, P., Tianyi, Z, Glassman, E. Expectation vs. experience: Evaluating the usability of code generation tools powered by large language models. In Proceedings of the 2022 Conf. Human Factors in Computing Systems Ext. Abstracts, 1–7; https://doi.org/10.1145/3491101.3519665

5. Williams, L. Pair programming. Making Software: What Really Works, and Why We Believe It. A. Oram and G. Wilson, eds. O'Reilly Media, 2011, 311–328.

6. Ziegler, A. Research: How GitHub Copilot helps improve developer productivity. GitHub Blog, 2022; http://bit.ly/3I9tUye.

7. Ziegler, A. et al. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN Intern. Symp. Machine Programming 2022, 21–29; https://doi.org/10.1145/3520312.3534864.

Back to Top

Authors

Christian Bird is a senior principal researcher in the Software Analysis and Intelligence (SAINTES) group at Microsoft Research. His work has focused on code review, branching and merging, developer productivity, and applying AI/ML to software engineering tasks.

Denae Ford is a senior researcher at Microsoft Research in the SAINTES group and an affiliate assistant professor in the Human Centered Design and Engineering Department at the University of Washington, Seattle, WA, USA.

Thomas Zimmermann is a senior principal researcher in the Software Analysis and Intelligence (SAINTES) group at Microsoft Research. He is best known for his work on software analytics and data science in software engineering.

Nicole Forsgren is a partner at Microsoft Research, where she leads Developer Velocity Lab. Her current work investigates AI and its role in transforming the software engineering process.

Eirini Kalliamvakou is a staff researcher at GitHub Next, where she leads user studies that shape the team's understanding of user needs and guide the iteration of prototypes.

Travis Lowdermilk is a principal UX researcher in the Developer Division at Microsoft. His work focuses on connecting product teams with their customers to uncover unmet needs and build innovative products.

Idan Gazit is a senior director of research at GitHub Next, where he leads the Developer Experiences research team. Prior to that, he led the Heroku Data UX team, and is an alumnus of the Django Web Framework core team.

Back to Top

Back to Top


Copyright held by owner/author. Publication rights licensed to ACM.
Request permission to publish from permissions@acm.org

The Digital Library is published by the Association for Computing Machinery. Copyright © 2023 ACM, Inc.


 

No entries found