Home Explore Software Engineering at Google: Lessons Learned from Programming Over Time

Software Engineering at Google: Lessons Learned from Programming Over Time

Published by Willington Island, 2021-08-23 09:44:11

Description: Today, software engineers need to know not only how to program effectively but also how to develop proper engineering practices to make their codebase sustainable and healthy. This book emphasizes this difference between programming and software engineering. How can software engineers manage a living codebase that evolves and responds to changing requirements and demands over the length of its life? Based on their experience at Google, software engineers Titus Winters and Hyrum Wright, along with technical writer Tom Manshreck, present a candid and insightful look at how some of the world’s leading practitioners construct and maintain software. This book covers Google’s unique engineering culture, processes, and tools and how these aspects contribute to the effectiveness of an engineering organization.

Read the Text Version

Pages:

CHAPTER 7 Measuring Engineering Productivity Written by Ciera Jaspen Edited by Riona Macnamara Google is a data-driven company. We back up most of our products and design deci‐ sions with hard data. The culture of data-driven decision making, using appropriate metrics, has some drawbacks, but overall, relying on data tends to make most deci‐ sions objective rather than subjective, which is often a good thing. Collecting and analyzing data on the human side of things, however, has its own challenges. Specifi‐ cally, within software engineering, Google has found that having a team of specialists focus on engineering productivity itself to be very valuable and important as the company scales and can leverage insights from such a team. Why Should We Measure Engineering Productivity? Let’s presume that you have a thriving business (e.g., you run an online search engine), and you want to increase your business’s scope (enter into the enterprise application market, or the cloud market, or the mobile market). Presumably, to increase the scope of your business, you’ll need to also increase the size of your engi‐ neering organization. However, as organizations grow in size linearly, communica‐ tion costs grow quadratically.1 Adding more people will be necessary to increase the scope of your business, but the communication overhead costs will not scale linearly as you add additional personnel. As a result, you won’t be able to scale the scope of your business linearly to the size of your engineering organization. 1 Frederick P. Brooks, The Mythical Man-Month: Essays on Software Engineering (New York: Addison-Wesley, 1995). 123

There is another way to address our scaling problem, though: we could make each individual more productive. If we can increase the productivity of individual engineers in the organization, we can increase the scope of our business without the commen‐ surate increase in communication overhead. Google has had to grow quickly into new businesses, which has meant learning how to make our engineers more productive. To do this, we needed to understand what makes them productive, identify inefficiencies in our engineering processes, and fix the identified problems. Then, we would repeat the cycle as needed in a continuous improvement loop. By doing this, we would be able to scale our engineering organi‐ zation with the increased demand on it. However, this improvement cycle also takes human resources. It would not be worth‐ while to improve the productivity of your engineering organization by the equivalent of 10 engineers per year if it took 50 engineers per year to understand and fix produc‐ tivity blockers. Therefore, our goal is to not only improve software engineering produc‐ tivity, but to do so efficiently. At Google, we addressed these trade-offs by creating a team of researchers dedicated to understanding engineering productivity. Our research team includes people from the software engineering research field and generalist software engineers, but we also include social scientists from a variety of fields, including cognitive psychology and behavioral economics. The addition of people from the social sciences allows us to not only study the software artifacts that engineers produce, but to also understand the human side of software development, including personal motivations, incentive structures, and strategies for managing complex tasks. The goal of the team is to take a data-driven approach to measuring and improving engineering productivity. In this chapter, we walk through how our research team achieves this goal. This begins with the triage process: there are many parts of software development that we can measure, but what should we measure? After a project is selected, we walk through how the research team identifies meaningful metrics that will identify the problematic parts of the process. Finally, we look at how Google uses these metrics to track improvements to productivity. For this chapter, we follow one concrete example posed by the C++ and Java language teams at Google: readability. For most of Google’s existence, these teams have man‐ aged the readability process at Google. (For more on readability, see Chapter 3.) The readability process was put in place in the early days of Google, before automatic for‐ matters (Chapter 8) and linters that block submission were commonplace (Chap‐ ter 9). The process itself is expensive to run because it requires hundreds of engineers performing readability reviews for other engineers in order to grant readability to them. Some engineers viewed it as an archaic hazing process that no longer held util‐ ity, and it was a favorite topic to argue about around the lunch table. The concrete 124 | Chapter 7: Measuring Engineering Productivity

question from the language teams was this: is the time spent on the readability pro‐ cess worthwhile? Triage: Is It Even Worth Measuring? Before we decide how to measure the productivity of engineers, we need to know when a metric is even worth measuring. The measurement itself is expensive: it takes people to measure the process, analyze the results, and disseminate them to the rest of the company. Furthermore, the measurement process itself might be onerous and slow down the rest of the engineering organization. Even if it is not slow, tracking progress might change engineers’ behavior, possibly in ways that mask the underlying issues. We need to measure and estimate smartly; although we don’t want to guess, we shouldn’t waste time and resources measuring unnecessarily. At Google, we’ve come up with a series of questions to help teams determine whether it’s even worth measuring productivity in the first place. We first ask people to describe what they want to measure in the form of a concrete question; we find that the more concrete people can make this question, the more likely they are to derive benefit from the process. When the readability team approached us, its question was simple: are the costs of an engineer going through the readability process worth the benefits they might be deriving for the company? We then ask them to consider the following aspects of their question: What result are you expecting, and why? Even though we might like to pretend that we are neutral investigators, we are not. We do have preconceived notions about what ought to happen. By acknowl‐ edging this at the outset, we can try to address these biases and prevent post hoc explanations of the results. When this question was posed to the readability team, it noted that it was not sure. People were certain the costs had been worth the benefits at one point in time, but with the advent of autoformatters and static analysis tools, no one was entirely certain. There was a growing belief that the process now served as a haz‐ ing ritual. Although it might still provide engineers with benefits (and they had survey data showing that people did claim these benefits), it was not clear whether it was worth the time commitment of the authors or the reviewers of the code. If the data supports your expected result, what action will be taken? We ask this because if no action will be taken, there is no point in measuring. Notice that an action might in fact be “maintain the status quo” if there is a plan‐ ned change that will occur if we didn’t have this result. Triage: Is It Even Worth Measuring? | 125

When asked about this, the answer from the readability team was straightfor‐ ward: if the benefit was enough to justify the costs of the process, they would link to the research and the data on the FAQ about readability and advertise it to set expectations. If we get a negative result, will appropriate action be taken? We ask this question because in many cases, we find that a negative result will not change a decision. There might be other inputs into a decision that would over‐ ride any negative result. If that is the case, it might not be worth measuring in the first place. This is the question that stops most of the projects that our research team takes on; we learn that the decision makers were interested in knowing the results, but for other reasons, they will not choose to change course. In the case of readability, however, we had a strong statement of action from the team. It committed that, if our analysis showed that the costs either outweighed the benefit or the benefits were negligible, the team would kill the process. As dif‐ ferent programming languages have different levels of maturity in formatters and static analyses, this evaluation would happen on a per-language basis. Who is going to decide to take action on the result, and when would they do it? We ask this to ensure that the person requesting the measurement is the one who is empowered to take action (or is doing so directly on their behalf). Ultimately, the goal of measuring our software process is to help people make business deci‐ sions. It’s important to understand who that individual is, including what form of data convinces them. Although the best research includes a variety of approaches (everything from structured interviews to statistical analyses of logs), there might be limited time in which to provide decision makers with the data they need. In those cases, it might be best to cater to the decision maker. Do they tend to make decisions by empathizing through the stories that can be retrieved from inter‐ views?2 Do they trust survey results or logs data? Do they feel comfortable with complex statistical analyses? If the decider doesn’t believe the form of the result in principle, there is again no point in measuring the process. In the case of readability, we had a clear decision maker for each programming language. Two language teams, Java and C++, actively reached out to us for assis‐ tance, and the others were waiting to see what happened with those languages 2 It’s worth pointing out here that our industry currently disparages “anecdata,” and everyone has a goal of being “data driven.” Yet anecdotes continue to exist because they are powerful. An anecdote can provide con‐ text and narrative that raw numbers cannot; it can provide a deep explanation that resonates with others because it mirrors personal experience. Although our researchers do not make decisions on anecdotes, we do use and encourage techniques such as structured interviews and case studies to deeply understand phenom‐ ena and provide context to quantitative data. 126 | Chapter 7: Measuring Engineering Productivity

first.3 The decision makers trusted engineers’ self-reported experiences for understanding happiness and learning, but the decision makers wanted to see “hard numbers” based on logs data for velocity and code quality. This meant that we needed to include both qualitative and quantitative analysis for these metrics. There was not a hard deadline for this work, but there was an internal conference that would make for a useful time for an announcement if there was going to be a change. That deadline gave us several months in which to complete the work. By asking these questions, we find that in many cases, measurement is simply not worthwhile…and that’s OK! There are many good reasons to not measure the impact of a tool or process on productivity. Here are some examples that we’ve seen: You can’t afford to change the process/tools right now There might be time constraints or financial constraints that prevent this. For example, you might determine that if only you switched to a faster build tool, it would save hours of time every week. However, the switchover will mean pausing development while everyone converts over, and there’s a major funding deadline approaching such that you cannot afford the interruption. Engineering trade-offs are not evaluated in a vacuum—in a case like this, it’s important to realize that the broader context completely justifies delaying action on a result. Any results will soon be invalidated by other factors Examples here might include measuring the software process of an organization just before a planned reorganization. Or measuring the amount of technical debt for a deprecated system. The decision maker has strong opinions, and you are unlikely to be able to pro‐ vide a large enough body of evidence, of the right type, to change their beliefs. This comes down to knowing your audience. Even at Google, we sometimes find people who have unwavering beliefs on a topic due to their past experiences. We have found stakeholders who never trust survey data because they do not believe self-reports. We’ve also found stakeholders who are swayed best by a compelling narrative that was informed by a small number of interviews. And, of course, there are stakeholders who are swayed only by logs analysis. In all cases, we attempt to triangulate on the truth using mixed methods, but if a stakeholder is limited to believing only in methods that are not appropriate for the problem, there is no point in doing the work. 3 Java and C++ have the greatest amount of tooling support. Both have mature formatters and static analysis tools that catch common mistakes. Both are also heavily funded internally. Even though other language teams, like Python, were interested in the results, clearly there was not going to be a benefit for Python to remove readability if we couldn’t even show the same benefit for Java or C++. Triage: Is It Even Worth Measuring? | 127

The results will be used only as vanity metrics to support something you were going to do anyway This is perhaps the most common reason we tell people at Google not to measure a software process. Many times, people have planned a decision for multiple rea‐ sons, and improving the software development process is only one benefit of sev‐ eral. For example, the release tool team at Google once requested a measurement to a planned change to the release workflow system. Due to the nature of the change, it was obvious that the change would not be worse than the current state, but they didn’t know if it was a minor improvement or a large one. We asked the team: if it turns out to only be a minor improvement, would you spend the resources to implement the feature anyway, even if it didn’t look to be worth the investment? The answer was yes! The feature happened to improve productivity, but this was a side effect: it was also more performant and lowered the release tool team’s maintenance burden. The only metrics available are not precise enough to measure the problem and can be confounded by other factors In some cases, the metrics needed (see the upcoming section on how to identify metrics) are simply unavailable. In these cases, it can be tempting to measure using other metrics that are less precise (lines of code written, for example). However, any results from these metrics will be uninterpretable. If the metric confirms the stakeholders’ preexisting beliefs, they might end up proceeding with their plan without consideration that the metric is not an accurate measure. If it does not confirm their beliefs, the imprecision of the metric itself provides an easy explanation, and the stakeholder might, again, proceed with their plan. When you are successful at measuring your software process, you aren’t setting out to prove a hypothesis correct or incorrect; success means giving a stakeholder the data they need to make a decision. If that stakeholder won’t use the data, the project is always a failure. We should only measure a software process when a concrete decision will be made based on the outcome. For the readability team, there was a clear deci‐ sion to be made. If the metrics showed the process to be beneficial, they would publi‐ cize the result. If not, the process would be abolished. Most important, the readability team had the authority to make this decision. 128 | Chapter 7: Measuring Engineering Productivity

Selecting Meaningful Metrics with Goals and Signals After we decide to measure a software process, we need to determine what metrics to use. Clearly, lines of code (LOC) won’t do,4 but how do we actually measure engineer‐ ing productivity? At Google, we use the Goals/Signals/Metrics (GSM) framework to guide metrics creation. • A goal is a desired end result. It’s phrased in terms of what you want to under‐ stand at a high level and should not contain references to specific ways to meas‐ ure it. • A signal is how you might know that you’ve achieved the end result. Signals are things we would like to measure, but they might not be measurable themselves. • A metric is proxy for a signal. It is the thing we actually can measure. It might not be the ideal measurement, but it is something that we believe is close enough. The GSM framework encourages several desirable properties when creating metrics. First, by creating goals first, then signals, and finally metrics, it prevents the streetlight effect. The term comes from the full phrase “looking for your keys under the street‐ light”: if you look only where you can see, you might not be looking in the right place. With metrics, this occurs when we use the metrics that we have easily accessible and that are easy to measure, regardless of whether those metrics suit our needs. Instead, GSM forces us to think about which metrics will actually help us achieve our goals, rather than simply what we have readily available. Second, GSM helps prevent both metrics creep and metrics bias by encouraging us to come up with the appropriate set of metrics, using a principled approach, in advance of actually measuring the result. Consider the case in which we select metrics without a principled approach and then the results do not meet our stakeholders’ expecta‐ tions. At that point, we run the risk that stakeholders will propose that we use differ‐ ent metrics that they believe will produce the desired result. And because we didn’t select based on a principled approach at the start, there’s no reason to say that they’re wrong! Instead, GSM encourages us to select metrics based on their ability to meas‐ ure the original goals. Stakeholders can easily see that these metrics map to their 4 “From there it is only a small step to measuring ‘programmer productivity’ in terms of ‘number of lines of code produced per month.’ This is a very costly measuring unit because it encourages the writing of insipid code, but today I am less interested in how foolish a unit it is from even a pure business point of view. My point today is that, if we wish to count lines of code, we should not regard them as ‘lines produced’ but as ‘lines spent’: the current conventional wisdom is so foolish as to book that count on the wrong side of the ledger.” Edsger Dijkstra, on the cruelty of really teaching computing science, EWD Manuscript 1036. Selecting Meaningful Metrics with Goals and Signals | 129

original goals and agree, in advance, that this is the best set of metrics for measuring the outcomes. Finally, GSM can show us where we have measurement coverage and where we do not. When we run through the GSM process, we list all our goals and create signals for each one. As we will see in the examples, not all signals are going to be measurable —and that’s OK! With GSM, at least we have identified what is not measurable. By identifying these missing metrics, we can assess whether it is worth creating new met‐ rics or even worth measuring at all. The important thing is to maintain traceability. For each metric, we should be able to trace back to the signal that it is meant to be a proxy for and to the goal it is trying to measure. This ensures that we know which metrics we are measuring and why we are measuring them. Goals A goal should be written in terms of a desired property, without reference to any metric. By themselves, these goals are not measurable, but a good set of goals is some‐ thing that everyone can agree on before proceeding onto signals and then metrics. To make this work, we need to have identified the correct set of goals to measure in the first place. This would seem straightforward: surely the team knows the goals of their work! However, our research team has found that in many cases, people forget to include all the possible trade-offs within productivity, which could lead to mismeasurement. Taking the readability example, let’s assume that the team was so focused on making the readability process fast and easy that it had forgotten the goal about code quality. The team set up tracking measurements for how long it takes to get through the review process and how happy engineers are with the process. One of our teammates proposes the following: I can make your review velocity very fast: just remove code reviews entirely. Although this is obviously an extreme example, teams forget core trade-offs all the time when measuring: they become so focused on improving velocity that they forget to measure quality (or vice versa). To combat this, our research team divides produc‐ tivity into five core components. These five components are in trade-off with one another, and we encourage teams to consider goals in each of these components to ensure that they are not inadvertently improving one while driving others downward. To help people remember all five components, we use the mnemonic “QUANTS”: 130 | Chapter 7: Measuring Engineering Productivity

Quality of the code What is the quality of the code produced? Are the test cases good enough to pre‐ vent regressions? How good is an architecture at mitigating risk and changes? Attention from engineers How frequently do engineers reach a state of flow? How much are they distracted by notifications? Does a tool encourage engineers to context switch? Intellectual complexity How much cognitive load is required to complete a task? What is the inherent complexity of the problem being solved? Do engineers need to deal with unnec‐ essary complexity? Tempo and velocity How quickly can engineers accomplish their tasks? How fast can they push their releases out? How many tasks do they complete in a given timeframe? Satisfaction How happy are engineers with their tools? How well does a tool meet engineers’ needs? How satisfied are they with their work and their end product? Are engi‐ neers feeling burned out? Going back to the readability example, our research team worked with the readability team to identify several productivity goals of the readability process: Quality of the code Engineers write higher-quality code as a result of the readability process; they write more consistent code as a result of the readability process; and they con‐ tribute to a culture of code health as a result of the readability process. Attention from engineers We did not have any attention goal for readability. This is OK! Not all questions about engineering productivity involve trade-offs in all five areas. Intellectual complexity Engineers learn about the Google codebase and best coding practices as a result of the readability process, and they receive mentoring during the readability process. Tempo and velocity Engineers complete work tasks faster and more efficiently as a result of the read‐ ability process. Satisfaction Engineers see the benefit of the readability process and have positive feelings about participating in it. Goals | 131

Signals A signal is the way in which we will know we’ve achieved our goal. Not all signals are measurable, but that’s acceptable at this stage. There is not a 1:1 relationship between signals and goals. Every goal should have at least one signal, but they might have more. Some goals might also share a signal. Table 7-1 shows some example signals for the goals of the readability process measurement. Table 7-1. Signals and goals Signals Goals Engineers who have been granted readability judge their code to be of Engineers write higher-quality code as a result of higher quality than engineers who have not been granted readability. the readability process. The readability process has a positive impact on code quality. Engineers report learning from the readability process. Engineers learn about the Google codebase and best coding practices as a result of the readability Engineers report positive interactions with experienced Google engineers process. who serve as reviewers during the readability process. Engineers receive mentoring during the Engineers who have been granted readability judge themselves to be readability process. more productive than engineers who have not been granted readability. Engineers complete work tasks faster and more Changes written by engineers who have been granted readability are efficiently as a result of the readability process. faster to review than changes written by engineers who have not been granted readability. Engineers see the benefit of the readability Engineers view the readability process as being worthwhile. process and have positive feelings about participating in it. Metrics Metrics are where we finally determine how we will measure the signal. Metrics are not the signal themselves; they are the measurable proxy of the signal. Because they are a proxy, they might not be a perfect measurement. For this reason, some signals might have multiple metrics as we try to triangulate on the underlying signal. For example, to measure whether engineers’ code is reviewed faster after readability, we might use a combination of both survey data and logs data. Neither of these met‐ rics really provide the underlying truth. (Human perceptions are fallible, and logs metrics might not be measuring the entire picture of the time an engineer spends reviewing a piece of code or can be confounded by factors unknown at the time, like the size or difficulty of a code change.) However, if these metrics show different results, it signals that possibly one of them is incorrect and we need to explore fur‐ ther. If they are the same, we have more confidence that we have reached some kind of truth. 132 | Chapter 7: Measuring Engineering Productivity

Additionally, some signals might not have any associated metric because the signal might simply be unmeasurable at this time. Consider, for example, measuring code quality. Although academic literature has proposed many proxies for code quality, none of them have truly captured it. For readability, we had a decision of either using a poor proxy and possibly making a decision based on it, or simply acknowledging that this is a point that cannot currently be measured. Ultimately, we decided not to capture this as a quantitative measure, though we did ask engineers to self-rate their code quality. Following the GSM framework is a great way to clarify the goals for why you are measuring your software process and how it will actually be measured. However, it’s still possible that the metrics selected are not telling the complete story because they are not capturing the desired signal. At Google, we use qualitative data to validate our metrics and ensure that they are capturing the intended signal. Using Data to Validate Metrics As an example, we once created a metric for measuring each engineer’s median build latency; the goal was to capture the “typical experience” of engineers’ build latencies. We then ran an experience sampling study. In this style of study, engineers are inter‐ rupted in context of doing a task of interest to answer a few questions. After an engi‐ neer started a build, we automatically sent them a small survey about their experiences and expectations of build latency. However, in a few cases, the engineers responded that they had not started a build! It turned out that automated tools were starting up builds, but the engineers were not blocked on these results and so it didn’t “count” toward their “typical experience.” We then adjusted the metric to exclude such builds.5 Quantitative metrics are useful because they give you power and scale. You can meas‐ ure the experience of engineers across the entire company over a large period of time and have confidence in the results. However, they don’t provide any context or narra‐ tive. Quantitative metrics don’t explain why an engineer chose to use an antiquated tool to accomplish their task, or why they took an unusual workflow, or why they cir‐ cumvented a standard process. Only qualitative studies can provide this information, and only qualitative studies can then provide insight on the next steps to improve a process. 5 It has routinely been our experience at Google that when the quantitative and qualitative metrics disagree, it was because the quantitative metrics were not capturing the expected result. Using Data to Validate Metrics | 133

Consider now the signals presented in Table 7-2. What metrics might you create to measure each of those? Some of these signals might be measurable by analyzing tool and code logs. Others are measurable only by directly asking engineers. Still others might not be perfectly measurable—how do we truly measure code quality, for example? Ultimately, when evaluating the impact of readability on productivity, we ended up with a combination of metrics from three sources. First, we had a survey that was specifically about the readability process. This survey was given to people after they completed the process; this allowed us to get their immediate feedback about the pro‐ cess. This hopefully avoids recall bias,6 but it does introduce both recency bias7 and sampling bias.8 Second, we used a large-scale quarterly survey to track items that were not specifically about readability; instead, they were purely about metrics that we expected readability should affect. Finally, we used fine-grained logs metrics from our developer tools to determine how much time the logs claimed it took engineers to complete specific tasks.9 Table 7-2 presents the complete list of metrics with their cor‐ responding signals and goals. Table 7-2. Goals, signals, and metrics QUANTS Goal Signal Metric Quality of the code Quarterly Survey: Proportion of Engineers write higher- Engineers who have been engineers who report being quality code as a result of the granted readability judge their satisfied with the quality of readability process. code to be of higher quality their own code than engineers who have not been granted readability. Readability Survey: Proportion of engineers reporting that The readability process has a readability reviews have no positive impact on code impact or negative impact on quality. code quality 6 Recall bias is the bias from memory. People are more likely to recall events that are particularly interesting or frustrating. 7 Recency bias is another form of bias from memory in which people are biased toward their most recent expe‐ rience. In this case, as they just successfully completed the process, they might be feeling particularly good about it. 8 Because we asked only those people who completed the process, we aren’t capturing the opinions of those who did not complete the process. 9 There is a temptation to use such metrics to evaluate individual engineers, or perhaps even to identify high and low performers. Doing so would be counterproductive, though. If productivity metrics are used for per‐ formance reviews, engineers will be quick to game the metrics, and they will no longer be useful for measur‐ ing and improving productivity across the organization. The only way to make these measurements work is to let go of the idea of measuring individuals and embrace measuring the aggregate effect. 134 | Chapter 7: Measuring Engineering Productivity

QUANTS Goal Signal Metric Engineers write more Engineers are given consistent Readability Survey: Proportion consistent code as a result of feedback and direction in code of engineers reporting that the readability process. reviews by readability participating in the readability reviewers as a part of the process has improved code Engineers contribute to a readability process. quality for their team culture of code health as a Engineers who have been result of the readability granted readability regularly Readability Survey: Proportion process. comment on style and/or of engineers reporting readability issues in code inconsistency in readability Attention from engineers n/a reviews. reviewers’ comments and n/a readability criteria. Intellectual Engineers learn about the Engineers report learning from Google codebase and best the readability process. Readability Survey: Proportion coding practices as a result of of engineers reporting that the readability process. Engineers report positive they regularly comment on interactions with experienced style and/or readability issues Tempo/velocity Engineers receive mentoring Google engineers who serve as in code reviews during the readability reviewers during the process. readability process. n/a Engineers who have been Engineers are more granted readability judge Readability Survey: Proportion productive as a result of the themselves to be more of engineers reporting that readability process. productive than engineers who they learned about four have not been granted relevant topics readability. Engineers report that Readability Survey: Proportion completing the readability of engineers reporting that process positively affects their learning or gaining expertise engineering velocity. was a strength of the Changelists (CLs) written by readability process engineers who have been granted readability are faster Readability Survey: Proportion to review than CLs written by of engineers reporting that engineers who have not been working with readability granted readability. reviewers was a strength of the readability process Quarterly Survey: Proportion of engineers reporting that they’re highly productive Readability Survey: Proportion of engineers reporting that not having readability reduces team engineering velocity Logs data: Median review time for CLs from authors with readability and without readability Using Data to Validate Metrics | 135

QUANTS Goal Signal Metric Satisfaction CLs written by engineers who Engineers see the benefit of have been granted readability Logs data: Median shepherding the readability process and are easier to shepherd through time for CLs from authors with have positive feelings about code review than CLs written readability and without participating in it. by engineers who have not readability been granted readability. CLs written by engineers who Logs data: Median time to have been granted readability submit for CLs from authors are faster to get through code with readability and without review than CLs written by readability engineers who have not been granted readability. Readability Survey: Proportion The readability process does of engineers reporting that the not have a negative impact on readability process negatively engineering velocity. impacts their velocity Engineers view the readability Readability Survey: Proportion process as being an overall of engineers reporting that positive experience. readability reviewers responded promptly Engineers view the readability process as being worthwhile Readability Survey: Proportion of engineers reporting that Engineers do not view the timeliness of reviews was a readability process as strength of the readability frustrating. process Readability Survey: Proportion of engineers reporting that their experience with the readability process was positive overall Readability Survey: Proportion of engineers reporting that the readability process is worthwhile Readability Survey: Proportion of engineers reporting that the quality of readability reviews is a strength of the process Readability Survey: Proportion of engineers reporting that thoroughness is a strength of the process Readability Survey: Proportion of engineers reporting that the readability process is uncertain, unclear, slow, or frustrating 136 | Chapter 7: Measuring Engineering Productivity

QUANTS Goal Signal Metric Quarterly Survey: Proportion of engineers reporting that they’re satisfied with their own engineering velocity Taking Action and Tracking Results Recall our original goal in this chapter: we want to take action and improve produc‐ tivity. After performing research on a topic, the team at Google always prepares a list of recommendations for how we can continue to improve. We might suggest new fea‐ tures to a tool, improving latency of a tool, improving documentation, removing obsolete processes, or even changing the incentive structures for the engineers. Ide‐ ally, these recommendations are “tool driven”: it does no good to tell engineers to change their process or way of thinking if the tools do not support them in doing so. We instead always assume that engineers will make the appropriate trade-offs if they have the proper data available and the suitable tools at their disposal. For readability, our study showed that it was overall worthwhile: engineers who had achieved readability were satisfied with the process and felt they learned from it. Our logs showed that they also had their code reviewed faster and submitted it faster, even accounting for no longer needing as many reviewers. Our study also showed places for improvement with the process: engineers identified pain points that would have made the process faster or more pleasant. The language teams took these recommen‐ dations and improved the tooling and process to make it faster and to be more trans‐ parent so that engineers would have a more pleasant experience. Conclusion At Google, we’ve found that staffing a team of engineering productivity specialists has widespread benefits to software engineering; rather than relying on each team to chart its own course to increase productivity, a centralized team can focus on broad- based solutions to complex problems. Such “human-based” factors are notoriously difficult to measure, and it is important for experts to understand the data being ana‐ lyzed given that many of the trade-offs involved in changing engineering processes are difficult to measure accurately and often have unintended consequences. Such a team must remain data driven and aim to eliminate subjective bias. TL;DRs • Before measuring productivity, ask whether the result is actionable, regardless of whether the result is positive or negative. If you can’t do anything with the result, it is likely not worth measuring. Taking Action and Tracking Results | 137

• Select meaningful metrics using the GSM framework. A good metric is a reason‐ able proxy to the signal you’re trying to measure, and it is traceable back to your original goals. • Select metrics that cover all parts of productivity (QUANTS). By doing this, you ensure that you aren’t improving one aspect of productivity (like developer veloc‐ ity) at the cost of another (like code quality). • Qualitative metrics are metrics, too! Consider having a survey mechanism for tracking longitudinal metrics about engineers’ beliefs. Qualitative metrics should also align with the quantitative metrics; if they do not, it is likely the quantitative metrics that are incorrect. • Aim to create recommendations that are built into the developer workflow and incentive structures. Even though it is sometimes necessary to recommend addi‐ tional training or documentation, change is more likely to occur if it is built into the developer’s daily habits. 138 | Chapter 7: Measuring Engineering Productivity

PART III Processes

CHAPTER 8 Style Guides and Rules Written by Shaindel Schwartz Edited by Tom Manshreck Most engineering organizations have rules governing their codebases—rules about where source files are stored, rules about the formatting of the code, rules about nam‐ ing and patterns and exceptions and threads. Most software engineers are working within the bounds of a set of policies that control how they operate. At Google, to manage our codebase, we maintain a set of style guides that define our rules. Rules are laws. They are not just suggestions or recommendations, but strict, manda‐ tory laws. As such, they are universally enforceable—rules may not be disregarded except as approved on a need-to-use basis. In contrast to rules, guidance provides recommendations and best practices. These bits are good to follow, even highly advis‐ able to follow, but unlike rules, they usually have some room for variance. We collect the rules that we define, the do’s and don’ts of writing code that must be followed, in our programming style guides, which are treated as canon. “Style” might be a bit of a misnomer here, implying a collection limited to formatting practices. Our style guides are more than that; they are the full set of conventions that govern our code. That’s not to say that our style guides are strictly prescriptive; style guide rules may call for judgement, such as the rule to use names that are “as descriptive as possible, within reason.” Rather, our style guides serve as the definitive source for the rules to which our engineers are held accountable. We maintain separate style guides for each of the programming languages used at Google.1 At a high level, all of the guides have similar goals, aiming to steer code 1 Many of our style guides have external versions, which you can find at https://google.github.io/styleguide. We cite numerous examples from these guides within this chapter. 141

development with an eye to sustainability. At the same time, there is a lot of variation among them in scope, length, and content. Programming languages have different strengths, different features, different priorities, and different historical paths to adoption within Google’s ever-evolving repositories of code. It is far more practical, therefore, to independently tailor each language’s guidelines. Some of our style guides are concise, focusing on a few overarching principles like naming and formatting, as demonstrated in our Dart, R, and Shell guides. Other style guides include far more detail, delving into specific language features and stretching into far lengthier docu‐ ments—notably, our C++, Python, and Java guides. Some style guides put a premium on typical non-Google use of the language—our Go style guide is very short, adding just a few rules to a summary directive to adhere to the practices outlined in the externally recognized conventions. Others include rules that fundamentally differ from external norms; our C++ rules disallow use of exceptions, a language feature widely used outside of Google code. The wide variance among even our own style guides makes it difficult to pin down the precise description of what a style guide should cover. The decisions guiding the development of Google’s style guides stem from the need to keep our codebase sus‐ tainable. Other organizations’ codebases will inherently have different requirements for sustainability that necessitate a different set of tailored rules. This chapter dis‐ cusses the principles and processes that steer the development of our rules and guid‐ ance, pulling examples primarily from Google’s C++, Python, and Java style guides. Why Have Rules? So why do we have rules? The goal of having rules in place is to encourage “good” behavior and discourage “bad” behavior. The interpretation of “good” and “bad” varies by organization, depending on what the organization cares about. Such desig‐ nations are not universal preferences; good versus bad is subjective, and tailored to needs. For some organizations, “good” might promote usage patterns that support a small memory footprint or prioritize potential runtime optimizations. In other organizations, “good” might promote choices that exercise new language features. Sometimes, an organization cares most deeply about consistency, so that anything inconsistent with existing patterns is “bad.” We must first recognize what a given organization values; we use rules and guidance to encourage and discourage behavior accordingly. As an organization grows, the established rules and guidelines shape the common vocabulary of coding. A common vocabulary allows engineers to concentrate on what their code needs to say rather than how they’re saying it. By shaping this vocabulary, engineers will tend to do the “good” things by default, even subcon‐ sciously. Rules thus give us broad leverage to nudge common development patterns in desired directions. 142 | Chapter 8: Style Guides and Rules

Creating the Rules When defining a set of rules, the key question is not, “What rules should we have?” The question to ask is, “What goal are we trying to advance?” When we focus on the goal that the rules will be serving, identifying which rules support this goal makes it easier to distill the set of useful rules. At Google, where the style guide serves as law for coding practices, we do not ask, “What goes into the style guide?” but rather, “Why does something go into the style guide?” What does our organization gain by having a set of rules to regulate writing code? Guiding Principles Let’s put things in context: Google’s engineering organization is composed of more than 30,000 engineers. That engineering population exhibits a wild variance in skill and background. About 60,000 submissions are made each day to a codebase of more than two billion lines of code that will likely exist for decades. We’re optimizing for a different set of values than most other organizations need, but to some degree, these concerns are ubiquitous—we need to sustain an engineering environment that is resilient to both scale and time. In this context, the goal of our rules is to manage the complexity of our development environment, keeping the codebase manageable while still allowing engineers to work productively. We are making a trade-off here: the large body of rules that helps us toward this goal does mean we are restricting choice. We lose some flexibility and we might even offend some people, but the gains of consistency and reduced conflict fur‐ nished by an authoritative standard win out. Given this view, we recognize a number of overarching principles that guide the development of our rules, which must: • Pull their weight • Optimize for the reader • Be consistent • Avoid error-prone and surprising constructs • Concede to practicalities when necessary Creating the Rules | 143

Rules must pull their weight Not everything should go into a style guide. There is a nonzero cost in asking all of the engineers in an organization to learn and adapt to any new rule that is set. With too many rules,2 not only will it become harder for engineers to remember all rele‐ vant rules as they write their code, but it also becomes harder for new engineers to learn their way. More rules also make it more challenging and more expensive to maintain the rule set. To this end, we deliberately chose not to include rules expected to be self-evident. Google’s style guide is not intended to be interpreted in a lawyerly fashion; just because something isn’t explicitly outlawed does not imply that it is legal. For exam‐ ple, the C++ style guide has no rule against the use of goto. C++ programmers already tend to avoid it, so including an explicit rule forbidding it would introduce unnecessary overhead. If just one or two engineers are getting something wrong, adding to everyone’s mental load by creating new rules doesn’t scale. Optimize for the reader Another principle of our rules is to optimize for the reader of the code rather than the author. Given the passage of time, our code will be read far more frequently than it is written. We’d rather the code be tedious to type than difficult to read. In our Python style guide, when discussing conditional expressions, we recognize that they are shorter than if statements and therefore more convenient for code authors. However, because they tend to be more difficult for readers to understand than the more ver‐ bose if statements, we restrict their usage. We value “simple to read” over “simple to write.” We’re making a trade-off here: it can cost more upfront when engineers must repeatedly type potentially longer, descriptive names for variables and types. We choose to pay this cost for the readability it provides for all future readers. As part of this prioritization, we also require that engineers leave explicit evidence of intended behavior in their code. We want readers to clearly understand what the code is doing as they read it. For example, our Java, JavaScript, and C++ style guides man‐ date use of the override annotation or keyword whenever a method overrides a superclass method. Without the explicit in-place evidence of design, readers can likely figure out this intent, though it would take a bit more digging on the part of each reader working through the code. 2 Tooling matters here. The measure for “too many” is not the raw number of rules in play, but how many an engineer needs to remember. For example, in the bad-old-days pre-clang-format, we needed to remember a ton of formatting rules. Those rules haven’t gone away, but with our current tooling, the cost of adherence has fallen dramatically. We’ve reached a point at which somebody could add an arbitrary number of formatting rules and nobody would care, because the tool just does it for you. 144 | Chapter 8: Style Guides and Rules

Evidence of intended behavior becomes even more important when it might be sur‐ prising. In C++, it is sometimes difficult to track the ownership of a pointer just by reading a snippet of code. If a pointer is passed to a function, without being familiar with the behavior of the function, we can’t be sure what to expect. Does the caller still own the pointer? Did the function take ownership? Can I continue using the pointer after the function returns or might it have been deleted? To avoid this problem, our C++ style guide prefers the use of std::unique_ptr when ownership transfer is intended. unique_ptr is a construct that manages pointer ownership, ensuring that only one copy of the pointer ever exists. When a function takes a unique_ptr as an argument and intends to take ownership of the pointer, callers must explicitly invoke move semantics: // Function that takes a Foo* and may or may not assume ownership of // the passed pointer. void TakeFoo(Foo* arg); // Calls to the function don’t tell the reader anything about what to // expect with regard to ownership after the function returns. Foo* my_foo(NewFoo()); TakeFoo(my_foo); Compare this to the following: // Function that takes a std::unique_ptr<Foo>. void TakeFoo(std::unique_ptr<Foo> arg); // Any call to the function explicitly shows that ownership is // yielded and the unique_ptr cannot be used after the function // returns. std::unique_ptr<Foo> my_foo(FooFactory()); TakeFoo(std::move(my_foo)); Given the style guide rule, we guarantee that all call sites will include clear evidence of ownership transfer whenever it applies. With this signal in place, readers of the code don’t need to understand the behavior of every function call. We provide enough information in the API to reason about its interactions. This clear documentation of behavior at the call sites ensures that code snippets remain readable and understanda‐ ble. We aim for local reasoning, where the goal is clear understanding of what’s hap‐ pening at the call site without needing to find and reference other code, including the function’s implementation. Most style guide rules covering comments are also designed to support this goal of in-place evidence for readers. Documentation comments (the block comments pre‐ pended to a given file, class, or function) describe the design or intent of the code that follows. Implementation comments (the comments interspersed throughout the code itself) justify or highlight non-obvious choices, explain tricky bits, and underscore important parts of the code. We have style guide rules covering both types of Creating the Rules | 145

comments, requiring engineers to provide the explanations another engineer might be looking for when reading through the code. Be consistent Our view on consistency within our codebase is similar to the philosophy we apply to our Google offices. With a large, distributed engineering population, teams are fre‐ quently split among offices, and Googlers often find themselves traveling to other sites. Although each office maintains its unique personality, embracing local flavor and style, for anything necessary to get work done, things are deliberately kept the same. A visiting Googler’s badge will work with all local badge readers; any Google devices will always get WiFi; the video conferencing setup in any conference room will have the same interface. A Googler doesn’t need to spend time learning how to get this all set up; they know that it will be the same no matter where they are. It’s easy to move between offices and still get work done. That’s what we strive for with our source code. Consistency is what enables any engi‐ neer to jump into an unfamiliar part of the codebase and get to work fairly quickly. A local project can have its unique personality, but its tools are the same, its techniques are the same, its libraries are the same, and it all Just Works. Advantages of consistency Even though it might feel restrictive for an office to be disallowed from customizing a badge reader or video conferencing interface, the consistency benefits far outweigh the creative freedom we lose. It’s the same with code: being consistent may feel con‐ straining at times, but it means more engineers get more work done with less effort:3 • When a codebase is internally consistent in its style and norms, engineers writing code and others reading it can focus on what’s getting done rather than how it is presented. To a large degree, this consistency allows for expert chunking.4 When we solve our problems with the same interfaces and format the code in a consis‐ tent way, it’s easier for experts to glance at some code, zero in on what’s impor‐ tant, and understand what it’s doing. It also makes it easier to modularize code and spot duplication. For these reasons, we focus a lot of attention on consistent naming conventions, consistent use of common patterns, and consistent format‐ ting and structure. There are also many rules that put forth a decision on a seem‐ ingly small issue solely to guarantee that things are done in only one way. For 3 Credit to H. Wright for the real-world comparison, made at the point of having visited around 15 different Google offices. 4 “Chunking” is a cognitive process that groups pieces of information together into meaningful “chunks” rather than keeping note of them individually. Expert chess players, for example, think about configurations of pieces rather than the positions of the individuals. 146 | Chapter 8: Style Guides and Rules

example, take the choice of the number of spaces to use for indentation or the limit set on line length.5 It’s the consistency of having one answer rather than the answer itself that is the valuable part here. • Consistency enables scaling. Tooling is key for an organization to scale, and con‐ sistent code makes it easier to build tools that can understand, edit, and generate code. The full benefits of the tools that depend on uniformity can’t be applied if everyone has little pockets of code that differ—if a tool can keep source files updated by adding missing imports or removing unused includes, if different projects are choosing different sorting strategies for their import lists, the tool might not be able to work everywhere. When everyone is using the same compo‐ nents and when everyone’s code follows the same rules for structure and organi‐ zation, we can invest in tooling that works everywhere, building in automation for many of our maintenance tasks. If each team needed to separately invest in a bespoke version of the same tool, tailored for their unique environment, we would lose that advantage. • Consistency helps when scaling the human part of an organization, too. As an organization grows, the number of engineers working on the codebase increases. Keeping the code that everyone is working on as consistent as possible enables better mobility across projects, minimizing the ramp-up time for an engineer switching teams and building in the ability for the organization to flex and adapt as headcount needs fluctuate. A growing organization also means that people in other roles interact with the code—SREs, library engineers, and code janitors, for example. At Google, these roles often span multiple projects, which means engi‐ neers unfamiliar with a given team’s project might jump in to work on that proj‐ ect’s code. A consistent experience across the codebase makes this efficient. • Consistency also ensures resilience to time. As time passes, engineers leave projects, new people join, ownership shifts, and projects merge or split. Striving for a consistent codebase ensures that these transitions are low cost and allows us nearly unconstrained fluidity for both the code and the engineers working on it, simplifying the processes necessary for long-term maintenance. 5 See 4.2 Block indentation: +2 spaces, Spaces vs. Tabs, 4.4 Column limit:100 and Line Length. Creating the Rules | 147

At Scale A few years ago, our C++ style guide promised to almost never change style guide rules that would make old code inconsistent: “In some cases, there might be good arguments for changing certain style rules, but we nonetheless keep things as they are in order to preserve consistency.” When the codebase was smaller and there were fewer old, dusty corners, that made sense. When the codebase grew bigger and older, that stopped being a thing to prioritize. This was (for the arbiters behind our C++ style guide, at least) a conscious change: when striking this bit, we were explicitly stating that the C++ codebase would never again be completely consistent, nor were we even aiming for that. It would simply be too much of a burden to not only update the rules to current best practices, but to also require that we apply those rules to everything that’s ever been written. Our Large Scale Change tooling and processes allow us to update almost all of our code to follow nearly every new pattern or syntax so that most old code exhib‐ its the most recent approved style (see Chapter 22). Such mechanisms aren’t perfect, however; when the codebase gets as large as it is, we can’t be sure every bit of old code can conform to the new best practices. Requiring perfect consistency has reached the point where there’s too much cost for the value. Setting the standard. When we advocate for consistency, we tend to focus on internal consistency. Sometimes, local conventions spring up before global ones are adopted, and it isn’t reasonable to adjust everything to match. In that case, we advocate a hier‐ archy of consistency: “Be consistent” starts locally, where the norms within a given file precede those of a given team, which precede those of the larger project, which precede those of the overall codebase. In fact, the style guides contain a number of rules that explicitly defer to local conventions,6 valuing this local consistency over a scientific technical choice. However, it is not always enough for an organization to create and stick to a set of internal conventions. Sometimes, the standards adopted by the external community should be taken into account. 6 Use of const, for example. 148 | Chapter 8: Style Guides and Rules

Counting Spaces The Python style guide at Google initially mandated two-space indents for all of our Python code. The standard Python style guide, used by the external Python commu‐ nity, uses four-space indents. Most of our early Python development was in direct support of our C++ projects, not for actual Python applications. We therefore chose to use two-space indentation to be consistent with our C++ code, which was already formatted in that manner. As time went by, we saw that this rationale didn’t really hold up. Engineers who write Python code read and write other Python code much more often than they read and write C++ code. We were costing our engineers extra effort every time they needed to look something up or reference external code snip‐ pets. We were also going through a lot of pain each time we tried to export pieces of our code into open source, spending time reconciling the differences between our internal code and the external world we wanted to join. When the time came for Starlark (a Python-based language designed at Google to serve as the build description language) to have its own style guide, we chose to change to using four-space indents to be consistent with the outside world.7 If conventions already exist, it is usually a good idea for an organization to be consis‐ tent with the outside world. For small, self-contained, and short-lived efforts, it likely won’t make a difference; internal consistency matters more than anything happening outside the project’s limited scope. Once the passage of time and potential scaling become factors, the likelihood of your code interacting with outside projects or even ending up in the outside world increase. Looking long-term, adhering to the widely accepted standard will likely pay off. Avoid error-prone and surprising constructs Our style guides restrict the use of some of the more surprising, unusual, or tricky constructs in the languages that we use. Complex features often have subtle pitfalls not obvious at first glance. Using these features without thoroughly understanding their complexities makes it easy to misuse them and introduce bugs. Even if a con‐ struct is well understood by a project’s engineers, future project members and main‐ tainers are not guaranteed to have the same understanding. This reasoning is behind our Python style guide ruling to avoid using power features such as reflection. The reflective Python functions hasattr() and getattr() allow a user to access attributes of objects using strings: 7 Style formatting for BUILD files implemented with Starlark is applied by buildifier. See https://github.com/ bazelbuild/buildtools. Creating the Rules | 149

if hasattr(my_object, 'foo'): some_var = getattr(my_object, 'foo') Now, with that example, everything might seem fine. But consider this: some_file.py: A_CONSTANT = [ 'foo', 'bar', 'baz', ] other_file.py: values = [] for field in some_file.A_CONSTANT: values.append(getattr(my_object, field)) When searching through code, how do you know that the fields foo, bar, and baz are being accessed here? There’s no clear evidence left for the reader. You don’t easily see and therefore can’t easily validate which strings are used to access attributes of your object. What if, instead of reading those values from A_CONSTANT, we read them from a Remote Procedure Call (RPC) request message or from a data store? Such obfusca‐ ted code could cause a major security flaw, one that would be very difficult to notice, simply by validating the message incorrectly. It’s also difficult to test and verify such code. Python’s dynamic nature allows such behavior, and in very limited circumstances, using hasattr() and getattr() is valid. In most cases, however, they just cause obfuscation and introduce bugs. Although these advanced language features might perfectly solve a problem for an expert who knows how to leverage them, power features are often more difficult to understand and are not very widely used. We need all of our engineers able to operate in the codebase, not just the experts. It’s not just support for the novice software engi‐ neer, but it’s also a better environment for SREs—if an SRE is debugging a production outage, they will jump into any bit of suspect code, even code written in a language in which they are not fluent. We place higher value on simplified, straightforward code that is easier to understand and maintain. Concede to practicalities In the words of Ralph Waldo Emerson: “A foolish consistency is the hobgoblin of lit‐ tle minds.” In our quest for a consistent, simplified codebase, we do not want to blindly ignore all else. We know that some of the rules in our style guides will encounter cases that warrant exceptions, and that’s OK. When necessary, we permit concessions to optimizations and practicalities that might otherwise conflict with our rules. 150 | Chapter 8: Style Guides and Rules

Performance matters. Sometimes, even if it means sacrificing consistency or readabil‐ ity, it just makes sense to accommodate performance optimizations. For example, although our C++ style guide prohibits use of exceptions, it includes a rule that allows the use of noexcept, an exception-related language specifier that can trigger compiler optimizations. Interoperability also matters. Code that is designed to work with specific non-Google pieces might do better if tailored for its target. For example, our C++ style guide includes an exception to the general CamelCase naming guideline that permits use of the standard library’s snake_case style for entities that mimic standard library fea‐ tures.8 The C++ style guide also allows exemptions for Windows programming, where compatibility with platform features requires multiple inheritance, something explicitly forbidden for all other C++ code. Both our Java and JavaScript style guides explicitly state that generated code, which frequently interfaces with or depends on components outside of a project’s ownership, is out of scope for the guide’s rules.9 Consistency is vital; adaptation is key. The Style Guide So, what does go into a language style guide? There are roughly three categories into which all style guide rules fall: • Rules to avoid dangers • Rules to enforce best practices • Rules to ensure consistency Avoiding danger First and foremost, our style guides include rules about language features that either must or must not be done for technical reasons. We have rules about how to use static members and variables; rules about using lambda expressions; rules about handling exceptions; rules about building for threading, access control, and class inheritance. We cover which language features to use and which constructs to avoid. We call out standard vocabulary types that may be used and for what purposes. We specifically include rulings on the hard-to-use and the hard-to-use-correctly—some language features have nuanced usage patterns that might not be intuitive or easy to apply 8 See Exceptions to Naming Rules. As an example, our open sourced Abseil libraries use snake_case naming for types intended to be replacements for standard types. See the types defined in https://github.com/abseil/abseil- cpp/blob/master/absl/utility/utility.h. These are C++11 implementation of C++14 standard types and therefore use the standard’s favored snake_case style instead of Google’s preferred CamelCase form. 9 See Generated code: mostly exempt. Creating the Rules | 151

properly, causing subtle bugs to creep in. For each ruling in the guide, we aim to include the pros and cons that were weighed with an explanation of the decision that was reached. Most of these decisions are based on the need for resilience to time, sup‐ porting and encouraging maintainable language usage. Enforcing best practices Our style guides also include rules enforcing some best practices of writing source code. These rules help keep the codebase healthy and maintainable. For example, we specify where and how code authors must include comments.10 Our rules for com‐ ments cover general conventions for commenting and extend to include specific cases that must include in-code documentation—cases in which intent is not always obvi‐ ous, such as fall-through in switch statements, empty exception catch blocks, and template metaprogramming. We also have rules detailing the structuring of source files, outlining the organization of expected content. We have rules about naming: naming of packages, of classes, of functions, of variables. All of these rules are intended to guide engineers to practices that support healthier, more sustainable code. Some of the best practices enforced by our style guides are designed to make source code more readable. Many formatting rules fall under this category. Our style guides specify when and how to use vertical and horizontal whitespace in order to improve readability. They also cover line length limits and brace alignment. For some lan‐ guages, we cover formatting requirements by deferring to autoformatting tools— gofmt for Go, dartfmt for Dart. Itemizing a detailed list of formatting requirements or naming a tool that must be applied, the goal is the same: we have a consistent set of formatting rules designed to improve readability that we apply to all of our code. Our style guides also include limitations on new and not-yet-well-understood lan‐ guage features. The goal is to preemptively install safety fences around a feature’s potential pitfalls while we all go through the learning process. At the same time, before everyone takes off running, limiting use gives us a chance to watch the usage patterns that develop and extract best practices from the examples we observe. For these new features, at the outset, we are sometimes not sure of the proper guidance to give. As adoption spreads, engineers wanting to use the new features in different ways discuss their examples with the style guide owners, asking for allowances to permit additional use cases beyond those covered by the initial restrictions. Watching the waiver requests that come in, we get a sense of how the feature is getting used and eventually collect enough examples to generalize good practice from bad. After we 10 See https://google.github.io/styleguide/cppguide.html#Comments, http://google.github.io/styleguide/pyguide#38- comments-and-docstrings, and https://google.github.io/styleguide/javaguide.html#s7-javadoc, where multiple languages define general comment rules. 152 | Chapter 8: Style Guides and Rules

have that information, we can circle back to the restrictive ruling and amend it to allow wider use. Case Study: Introducing std::unique_ptr When C++11 introduced std::unique_ptr, a smart pointer type that expresses exclusive ownership of a dynamically allocated object and deletes the object when the unique_ptr goes out of scope, our style guide initially disallowed usage. The behavior of the unique_ptr was unfamiliar to most engineers, and the related move semantics that the language introduced were very new and, to most engineers, very confusing. Preventing the introduction of std::unique_ptr in the codebase seemed the safer choice. We updated our tooling to catch references to the disallowed type and kept our existing guidance recommending other types of existing smart pointers. Time passed. Engineers had a chance to adjust to the implications of move semantics and we became increasingly convinced that using std::unique_ptr was directly in line with the goals of our style guidance. The information regarding object ownership that a std::unique_ptr facilitates at a function call site makes it far easier for a reader to understand that code. The added complexity of introducing this new type, and the novel move semantics that come with it, was still a strong concern, but the significant improvement in the long-term overall state of the codebase made the adoption of std::unique_ptr a worthwhile trade-off. Building in consistency Our style guides also contain rules that cover a lot of the smaller stuff. For these rules, we make and document a decision primarily to make and document a decision. Many rules in this category don’t have significant technical impact. Things like naming con‐ ventions, indentation spacing, import ordering: there is usually no clear, measurable, technical benefit for one form over another, which might be why the technical com‐ munity tends to keep debating them.11 By choosing one, we’ve dropped out of the endless debate cycle and can just move on. Our engineers no longer spend time dis‐ cussing two spaces versus four. The important bit for this category of rules is not what we’ve chosen for a given rule so much as the fact that we have chosen. And for everything else... With all that, there’s a lot that’s not in our style guides. We try to focus on the things that have the greatest impact on the health of our codebase. There are absolutely best practices left unspecified by these documents, including many fundamental pieces of good engineering advice: don’t be clever, don’t fork the codebase, don’t reinvent the 11 Such discussions are really just bikeshedding, an illustration of Parkinson’s law of triviality. Creating the Rules | 153

wheel, and so on. Documents like our style guides can’t serve to take a complete nov‐ ice all the way to a master-level understanding of software engineering—there are some things we assume, and this is intentional. Changing the Rules Our style guides aren’t static. As with most things, given the passage of time, the land‐ scape within which a style guide decision was made and the factors that guided a given ruling are likely to change. Sometimes, conditions change enough to warrant reevaluation. If a new language version is released, we might want to update our rules to allow or exclude new features and idioms. If a rule is causing engineers to invest effort to circumvent it, we might need to reexamine the benefits the rule was sup‐ posed to provide. If the tools that we use to enforce a rule become overly complex and burdensome to maintain, the rule itself might have decayed and need to be revis‐ ited. Noticing when a rule is ready for another look is an important part of the pro‐ cess that keeps our rule set relevant and up to date. The decisions behind rules captured in our style guides are backed by evidence. When adding a rule, we spend time discussing and analyzing the relevant pros and cons as well as the potential consequences, trying to verify that a given change is appropriate for the scale at which Google operates. Most entries in Google’s style guides include these considerations, laying out the pros and cons that were weighed during the process and giving the reasoning for the final ruling. Ideally, we prioritize this detailed reasoning and include it with every rule. Documenting the reasoning behind a given decision gives us the advantage of being able to recognize when things need to change. Given the passage of time and chang‐ ing conditions, a good decision made previously might not be the best current one. With influencing factors clearly noted, we are able to identify when changes related to one or more of these factors warrant reevaluating the rule. Case Study: CamelCase Naming At Google, when we defined our initial style guidance for Python code, we chose to use CamelCase naming style instead of snake_case naming style for method names. Although the public Python style guide (PEP 8) and most of the Python community used snake_case naming, most of Google’s Python usage at the time was for C++ developers using Python as a scripting layer on top of a C++ codebase. Many of the defined Python types were wrappers for corresponding C++ types, and because Goo‐ gle’s C++ naming conventions follow CamelCase style, the cross-language consistency was seen as key. Later, we reached a point at which we were building and supporting independent Python applications. The engineers most frequently using Python were Python engi‐ 154 | Chapter 8: Style Guides and Rules

neers developing Python projects, not C++ engineers pulling together a quick script. We were causing a degree of awkwardness and readability problems for our Python engineers, requiring them to maintain one standard for our internal code but con‐ stantly adjust for another standard every time they referenced external code. We were also making it more difficult for new hires who came in with Python experience to adapt to our codebase norms. As our Python projects grew, our code more frequently interacted with external Python projects. We were incorporating third-party Python libraries for some of our projects, leading to a mix within our codebase of our own CamelCase format with the externally preferred snake_case style. As we started to open source some of our Python projects, maintaining them in an external world where our conventions were nonconformist added both complexity on our part and wariness from a community that found our style surprising and somewhat weird. Presented with these arguments, after discussing both the costs (losing consistency with other Google code, reeducation for Googlers used to our Python style) and ben‐ efits (gaining consistency with most other Python code, allowing what was already leaking in with third-party libraries), the style arbiters for the Python style guide deci‐ ded to change the rule. With the restriction that it be applied as a file-wide choice, an exemption for existing code, and the latitude for projects to decide what is best for them, the Google Python style guide was updated to permit snake_case naming. The Process Recognizing that things will need to change, given the long lifetime and ability to scale that we are aiming for, we created a process for updating our rules. The process for changing our style guide is solution based. Proposals for style guide updates are framed with this view, identifying an existing problem and presenting the proposed change as a way to fix it. “Problems,” in this process, are not hypothetical examples of things that could go wrong; problems are proven with patterns found in existing Google code. Given a demonstrated problem, because we have the detailed reasoning behind the existing style guide decision, we can reevaluate, checking whether a differ‐ ent conclusion now makes more sense. The community of engineers writing code governed by the style guide are often best positioned to notice when a rule might need to be changed. Indeed, here at Google, most changes to our style guides begin with community discussion. Any engineer can ask questions or propose a change, usually by starting with the language-specific mailing lists dedicated to style guide discussions. Proposals for style guide changes might come fully-formed, with specific, updated wording suggested, or might start as vague questions about the applicability of a given rule. Incoming ideas are discussed by the community, receiving feedback from other language users. Some proposals are rejected by community consensus, gauged Changing the Rules | 155

to be unnecessary, too ambiguous, or not beneficial. Others receive positive feedback, gauged to have merit either as-is or with some suggested refinement. These propos‐ als, the ones that make it through community review, are subject to final decision- making approval. The Style Arbiters At Google, for each language’s style guide, final decisions and approvals are made by the style guide’s owners—our style arbiters. For each programming language, a group of long-time language experts are the owners of the style guide and the designated decision makers. The style arbiters for a given language are often senior members of the language’s library team and other long-time Googlers with relevant language experience. The actual decision making for any style guide change is a discussion of the engineer‐ ing trade-offs for the proposed modification. The arbiters make decisions within the context of the agreed-upon goals for which the style guide optimizes. Changes are not made according to personal preference; they’re trade-off judgments. In fact, the C++ style arbiter group currently consists of four members. This might seem strange: hav‐ ing an odd number of committee members would prevent tied votes in case of a split decision. However, because of the nature of the decision making approach, where nothing is “because I think it should be this way” and everything is an evaluation of trade-off, decisions are made by consensus rather than by voting. The four-member group is happily functional as-is. Exceptions Yes, our rules are law, but yes, some rules warrant exceptions. Our rules are typically designed for the greater, general case. Sometimes, specific situations would benefit from an exemption to a particular rule. When such a scenario arises, the style arbiters are consulted to determine whether there is a valid case for granting a waiver to a particular rule. Waivers are not granted lightly. In C++ code, if a macro API is introduced, the style guide mandates that it be named using a project-specific prefix. Because of the way C++ handles macros, treating them as members of the global namespace, all macros that are exported from header files must have globally unique names to prevent colli‐ sions. The style guide rule regarding macro naming does allow for arbiter-granted exemptions for some utility macros that are genuinely global. However, when the rea‐ son behind a waiver request asking to exclude a project-specific prefix comes down to preferences due to macro name length or project consistency, the waiver is rejected. The integrity of the codebase outweighs the consistency of the project here. Exceptions are allowed for cases in which it is gauged to be more beneficial to permit the rule-breaking than to avoid it. The C++ style guide disallows implicit type con‐ 156 | Chapter 8: Style Guides and Rules

versions, including single-argument constructors. However, for types that are designed to transparently wrap other types, where the underlying data is still accu‐ rately and precisely represented, it’s perfectly reasonable to allow implicit conversion. In such cases, waivers to the no-implicit-conversion rule are granted. Having such a clear case for valid exemptions might indicate that the rule in question needs to be clarified or amended. However, for this specific rule, enough waiver requests are received that appear to fit the valid case for exemption but in fact do not—either because the specific type in question is not actually a transparent wrapper type or because the type is a wrapper but is not actually needed—that keeping the rule in place as-is is still worthwhile. Guidance In addition to rules, we curate programming guidance in various forms, ranging from long, in-depth discussion of complex topics to short, pointed advice on best practices that we endorse. Guidance represents the collected wisdom of our engineering experience, document‐ ing the best practices that we’ve extracted from the lessons learned along the way. Guidance tends to focus on things that we’ve observed people frequently getting wrong or new things that are unfamiliar and therefore subject to confusion. If the rules are the “musts,” our guidance is the “shoulds.” One example of a pool of guidance that we cultivate is a set of primers for some of the predominant languages that we use. While our style guides are prescriptive, ruling on which language features are allowed and which are disallowed, the primers are descriptive, explaining the features that the guides endorse. They are quite broad in their coverage, touching on nearly every topic that an engineer new to the language’s use at Google would need to reference. They do not delve into every detail of a given topic, but they provide explanations and recommended use. When an engineer needs to figure out how to apply a feature that they want to use, the primers aim to serve as the go-to guiding reference. A few years ago, we began publishing a series of C++ tips that offered a mix of gen‐ eral language advice and Google-specific tips. We cover hard things—object lifetime, copy and move semantics, argument-dependent lookup; new things—C++ 11 fea‐ tures as they were adopted in the codebase, preadopted C++17 types like string_view, optional, and variant; and things that needed a gentle nudge of cor‐ rection—reminders not to use using directives, warnings to remember to look out for implicit bool conversions. The tips grow out of actual problems encountered, addressing real programming issues that are not covered by the style guides. Their advice, unlike the rules in the style guide, are not true canon; they are still in the category of advice rather than rule. However, given the way they grow from observed patterns rather than abstract ideals, their broad and direct applicability set them apart Guidance | 157

from most other advice as a sort of “canon of the common.” Tips are narrowly focused and relatively short, each one no more than a few minutes’ read. This “Tip of the Week” series has been extremely successful internally, with frequent citations dur‐ ing code reviews and technical discussions.12 Software engineers come in to a new project or codebase with knowledge of the pro‐ gramming language they are going to be using, but lacking the knowledge of how the programming language is used within Google. To bridge this gap, we maintain a ser‐ ies of “<Language>@Google 101” courses for each of the primary programming lan‐ guages in use. These full-day courses focus on what makes development with that language different in our codebase. They cover the most frequently used libraries and idioms, in-house preferences, and custom tool usage. For a C++ engineer who has just become a Google C++ engineer, the course fills in the missing pieces that make them not just a good engineer, but a good Google codebase engineer. In addition to teaching courses that aim to get someone completely unfamiliar with our setup up and running quickly, we also cultivate ready references for engineers deep in the codebase to find the information that could help them on the go. These references vary in form and span the languages that we use. Some of the useful refer‐ ences that we maintain internally include the following: • Language-specific advice for the areas that are generally more difficult to get cor‐ rect (such as concurrency and hashing). • Detailed breakdowns of new features that are introduced with a language update and advice on how to use them within the codebase. • Listings of key abstractions and data structures provided by our libraries. This keeps us from reinventing structures that already exist and provides a response to, “I need a thing, but I don’t know what it’s called in our libraries.” Applying the Rules Rules, by their nature, lend greater value when they are enforceable. Rules can be enforced socially, through teaching and training, or technically, with tooling. We have various formal training courses at Google that cover many of the best practices that our rules require. We also invest resources in keeping our documentation up to date to ensure that reference material remains accurate and current. A key part of our overall training approach when it comes to awareness and understanding of our rules is the role that code reviews play. The readability process that we run here at Google —where engineers new to Google’s development environment for a given language are mentored through code reviews—is, to a great extent, about cultivating the habits 12 https://abseil.io/tips has a selection of some of our most popular tips. 158 | Chapter 8: Style Guides and Rules

and patterns required by our style guides (see details on the readability process in Chapter 3). The process is an important piece of how we ensure that these practices are learned and applied across project boundaries. Although some level of training is always necessary—engineers must, after all, learn the rules so that they can write code that follows them—when it comes to checking for compliance, rather than exclusively depending on engineer-based verification, we strongly prefer to automate enforcement with tooling. Automated rule enforcement ensures that rules are not dropped or forgotten as time passes or as an organization scales up. New people join; they might not yet know all the rules. Rules change over time; even with good communication, not everyone will remember the current state of everything. Projects grow and add new features; rules that had previously not been relevant are suddenly applicable. An engineer checking for rule compliance depends on either memory or documentation, both of which can fail. As long as our tooling stays up to date, in sync with our rule changes, we know that our rules are being applied by all our engineers for all our projects. Another advantage to automated enforcement is minimization of the variance in how a rule is interpreted and applied. When we write a script or use a tool to check for compliance, we validate all inputs against a single, unchanging definition of the rule. We aren’t leaving interpretation up to each individual engineer. Human engineers view everything with a perspective colored by their biases. Unconscious or not, potentially subtle, and even possibly harmless, biases still change the way people view things. Leaving enforcement up to engineers is likely to see inconsistent interpreta‐ tion and application of the rules, potentially with inconsistent expectations of accountability. The more that we delegate to the tools, the fewer entry points we leave for human biases to enter. Tooling also makes enforcement scalable. As an organization grows, a single team of experts can write tools that the rest of the company can use. If the company doubles in size, the effort to enforce all rules across the entire organization doesn’t double, it costs about the same as it did before. Even with the advantages we get by incorporating tooling, it might not be possible to automate enforcement for all rules. Some technical rules explicitly call for human judgment. In the C++ style guide, for example: “Avoid complicated template meta‐ programming.” “Use auto to avoid type names that are noisy, obvious, or unimpor‐ tant—cases where the type doesn’t aid in clarity for the reader.” “Composition is often more appropriate than inheritance.” In the Java style guide: “There’s no single correct recipe for how to [order the members and initializers of your class]; different classes may order their contents in different ways.” “It is very rarely correct to do nothing in response to a caught exception.” “It is extremely rare to override Object.finalize.” For all of these rules, judgment is required and tooling can’t (yet!) take that place. Applying the Rules | 159

Other rules are social rather than technical, and it is often unwise to solve social problems with a technical solution. For many of the rules that fall under this category, the details tend to be a bit less well defined and tooling would become complex and expensive. It’s often better to leave enforcement of those rules to humans. For exam‐ ple, when it comes to the size of a given code change (i.e., the number of files affected and lines modified) we recommend that engineers favor smaller changes. Small changes are easier for engineers to review, so reviews tend to be faster and more thor‐ ough. They’re also less likely to introduce bugs because it’s easier to reason about the potential impact and effects of a smaller change. The definition of small, however, is somewhat nebulous. A change that propagates the identical one-line update across hundreds of files might actually be easy to review. By contrast, a smaller, 20-line change might introduce complex logic with side effects that are difficult to evaluate. We recognize that there are many different measurements of size, some of which may be subjective—particularly when taking the complexity of a change into account. This is why we do not have any tooling to autoreject a proposed change that exceeds an arbitrary line limit. Reviewers can (and do) push back if they judge a change to be too large. For this and similar rules, enforcement is up to the discretion of the engineers authoring and reviewing the code. When it comes to technical rules, however, when‐ ever it is feasible, we favor technical enforcement. Error Checkers Many rules covering language usage can be enforced with static analysis tools. In fact, an informal survey of the C++ style guide by some of our C++ librarians in mid-2018 estimated that roughly 90% of its rules could be automatically verified. Error- checking tools take a set of rules or patterns and verify that a given code sample fully complies. Automated verification removes the burden of remembering all applicable rules from the code author. If an engineer only needs to look for violation warnings— many of which come with suggested fixes—surfaced during code review by an ana‐ lyzer that has been tightly integrated into the development workflow, we minimize the effort that it takes to comply with the rules. When we began using tools to flag deprecated functions based on source tagging, surfacing both the warning and the suggested fix in-place, the problem of having new usages of deprecated APIs disap‐ peared almost overnight. Keeping the cost of compliance down makes it more likely for engineers to happily follow through. We use tools like clang-tidy (for C++) and Error Prone (for Java) to automate the process of enforcing rules. See Chapter 20 for an in-depth discussion of our approach. The tools we use are designed and tailored to support the rules that we define. Most tools in support of rules are absolutes; everybody must comply with the rules, so everybody uses the tools that check them. Sometimes, when tools support best practi‐ 160 | Chapter 8: Style Guides and Rules

ces where there’s a bit more flexibility in conforming to the conventions, there are opt-out mechanisms to allow projects to adjust for their needs. Code Formatters At Google, we generally use automated style checkers and formatters to enforce con‐ sistent formatting within our code. The question of line lengths has stopped being interesting.13 Engineers just run the style checkers and keep moving forward. When formatting is done the same way every time, it becomes a non-issue during code review, eliminating the review cycles that are otherwise spent finding, flagging, and fixing minor style nits. In managing the largest codebase ever, we’ve had the opportunity to observe the results of formatting done by humans versus formatting done by automated tooling. The robots are better on average than the humans by a significant amount. There are some places where domain expertise matters—formatting a matrix, for example, is something a human can usually do better than a general-purpose formatter. Failing that, formatting code with an automated style checker rarely goes wrong. We enforce use of these formatters with presubmit checks: before code can be sub‐ mitted, a service checks whether running the formatter on the code produces any diffs. If it does, the submit is rejected with instructions on how to run the formatter to fix the code. Most code at Google is subject to such a presubmit check. For our code, we use clang-format for C++; an in-house wrapper around yapf for Python; gofmt for Go; dartfmt for Dart; and buildifier for our BUILD files. Case Study: gofmt Sameer Ajmani Google released the Go programming language as open source on November 10, 2009. Since then, Go has grown as a language for developing services, tools, cloud infrastructure, and open source software.14 We knew that we needed a standard format for Go code from day one. We also knew that it would be nearly impossible to retrofit a standard format after the open source release. So the initial Go release included gofmt, the standard formatting tool for Go. 13 When you consider that it takes at least two engineers to have the discussion and multiply that by the number of times this conversation is likely to happen within a collection of more than 30,000 engineers, it turns out that “how many characters” can become a very expensive question. 14 In December 2018, Go was the #4 language on GitHub as measured by pull requests. Applying the Rules | 161

Motivations Code reviews are a software engineering best practice, yet too much time was spent in review arguing over formatting. Although a standard format wouldn’t be everyone’s favorite, it would be good enough to eliminate this wasted time.15 By standardizing the format, we laid the foundation for tools that could automatically update Go code without creating spurious diffs: machine-edited code would be indis‐ tinguishable from human-edited code.16 For example, in the months leading up to Go 1.0 in 2012, the Go team used a tool called gofix to automatically update pre-1.0 Go code to the stable version of the lan‐ guage and libraries. Thanks to gofmt, the diffs gofix produced included only the important bits: changes to uses of the language and APIs. This allowed programmers to more easily review the changes and learn from the changes the tool made. Impact Go programmers expect that all Go code is formatted with gofmt. gofmt has no con‐ figuration knobs, and its behavior rarely changes. All major editors and IDEs use gofmt or emulate its behavior, so nearly all Go code in existence is formatted identi‐ cally. At first, Go users complained about the enforced standard; now, users often cite gofmt as one of the many reasons they like Go. Even when reading unfamiliar Go code, the format is familiar. Thousands of open source packages read and write Go code.17 Because all editors and IDEs agree on the Go format, Go tools are portable and easily integrated into new developer environments and workflows via the command line. Retrofitting In 2012, we decided to automatically format all BUILD files at Google using a new standard formatter: buildifier. BUILD files contain the rules for building Google’s software with Blaze, Google’s build system. A standard BUILD format would enable us to create tools that automatically edit BUILD files without disrupting their format, just as Go tools do with Go files. It took six weeks for one engineer to get the reformatting of Google’s 200,000 BUILD files accepted by the various code owners, during which more than a thousand new 15 Robert Griesemer’s 2015 talk, “The Cultural Evolution of gofmt,” provides details on the motivation, design, and impact of gofmt on Go and other languages. 16 Russ Cox explained in 2009 that gofmt was about automated changes: “So we have all the hard parts of a pro‐ gram manipulation tool just sitting waiting to be used. Agreeing to accept ‘gofmt style’ is the piece that makes it doable in a finite amount of code.” 17 The Go AST and format packages each have thousands of importers. 162 | Chapter 8: Style Guides and Rules

BUILD files were added each week. Google’s nascent infrastructure for making large- scale changes greatly accelerated this effort. (See Chapter 22.) Conclusion For any organization, but especially for an organization as large as Google’s engineer‐ ing force, rules help us to manage complexity and build a maintainable codebase. A shared set of rules frames the engineering processes so that they can scale up and keep growing, keeping both the codebase and the organization sustainable for the long term. TL;DRs • Rules and guidance should aim to support resilience to time and scaling. • Know the data so that rules can be adjusted. • Not everything should be a rule. • Consistency is key. • Automate enforcement when possible. Conclusion | 163

CHAPTER 9 Code Review Written by Tom Manshreck and Caitlin Sadowski Edited by Lisa Carey Code review is a process in which code is reviewed by someone other than the author, often before the introduction of that code into a codebase. Although that is a simple definition, implementations of the process of code review vary widely throughout the software industry. Some organizations have a select group of “gatekeepers” across the codebase that review changes. Others delegate code review processes to smaller teams, allowing different teams to require different levels of code review. At Google, essentially every change is reviewed before being committed, and every engineer is responsible for initiating reviews and reviewing changes. Code reviews generally require a combination of a process and a tool supporting that process. At Google, we use a custom code review tool, Critique, to support our pro‐ cess.1 Critique is an important enough tool at Google to warrant its own chapter in this book. This chapter focuses on the process of code review as it is practiced at Goo‐ gle rather than the specific tool, both because these foundations are older than the tool and because most of these insights can be adapted to whatever tool you might use for code review. For more information on Critique, see Chapter 19. 1 We also use Gerrit to review Git code, primarily for our open source projects. However, Critique is the pri‐ mary tool of a typical software engineer at Google. 165

Some of the benefits of code review, such as detecting bugs in code before they enter a codebase, are well established2 and somewhat obvious (if imprecisely measured). Other benefits, however, are more subtle. Because the code review process at Google is so ubiquitous and extensive, we’ve noticed many of these more subtle effects, including psychological ones, which provide many benefits to an organization over time and scale. Code Review Flow Code reviews can happen at many stages of software development. At Google, code reviews take place before a change can be committed to the codebase; this stage is also known as a precommit review. The primary end goal of a code review is to get another engineer to consent to the change, which we denote by tagging the change as “looks good to me” (LGTM). We use this LGTM as a necessary permissions “bit” (combined with other bits noted below) to allow the change to be committed. A typical code review at Google goes through the following steps: 1. A user writes a change to the codebase in their workspace. This author then cre‐ ates a snapshot of the change: a patch and corresponding description that are uploaded to the code review tool. This change produces a diff against the code‐ base, which is used to evaluate what code has changed. 2. The author can use this initial patch to apply automated review comments or do self-review. When the author is satisfied with the diff of the change, they mail the change to one or more reviewers. This process notifies those reviewers, asking them to view and comment on the snapshot. 3. Reviewers open the change in the code review tool and post comments on the diff. Some comments request explicit resolution. Some are merely informational. 4. The author modifies the change and uploads new snapshots based on the feed‐ back and then replies back to the reviewers. Steps 3 and 4 may be repeated multi‐ ple times. 5. After the reviewers are happy with the latest state of the change, they agree to the change and accept it by marking it as “looks good to me” (LGTM). Only one LGTM is required by default, although convention might request that all review‐ ers agree to the change. 6. After a change is marked LGTM, the author is allowed to commit the change to the codebase, provided they resolve all comments and that the change is approved. We’ll cover approval in the next section. 2 Steve McConnell, Code Complete (Redmond: Microsoft Press, 2004). 166 | Chapter 9: Code Review

We’ll go over this process in more detail later in this chapter. Code Is a Liability It’s important to remember (and accept) that code itself is a liability. It might be a nec‐ essary liability, but by itself, code is simply a maintenance task to someone some‐ where down the line. Much like the fuel that an airplane carries, it has weight, though it is, of course, necessary for that airplane to fly. New features are often necessary, of course, but care should be taken before develop‐ ing code in the first place to ensure that any new feature is warranted. Duplicated code not only is a wasted effort, it can actually cost more in time than not having the code at all; changes that could be easily performed under one code pattern often require more effort when there is duplication in the codebase. Writing entirely new code is so frowned upon that some of us have a saying: “If you’re writing it from scratch, you’re doing it wrong!” This is especially true of library or utility code. Chances are, if you are writing a util‐ ity, someone else somewhere in a codebase the size of Google’s has probably done something similar. Tools such as those discussed in Chapter 17 are therefore critical for both finding such utility code and preventing the introduction of duplicate code. Ideally, this research is done beforehand, and a design for anything new has been communicated to the proper groups before any new code is written. Of course, new projects happen, new techniques are introduced, new components are needed, and so on. All that said, a code review is not an occasion to rehash or debate previous design decisions. Design decisions often take time, requiring the circulation of design proposals, debate on the design in API reviews or similar meetings, and per‐ haps the development of prototypes. As much as a code review of entirely new code should not come out of the blue, the code review process itself should also not be viewed as an opportunity to revisit previous decisions. How Code Review Works at Google We’ve pointed out roughly how the typical code review process works, but the devil is in the details. This section outlines in detail how code review works at Google and how these practices allow it to scale properly over time. There are three aspects of review that require “approval” for any given change at Google: • A correctness and comprehension check from another engineer that the code is appropriate and does what the author claims it does. This is often a team mem‐ ber, though it does not need to be. This is reflected in the LGTM permissions How Code Review Works at Google | 167

“bit,” which will be set after a peer reviewer agrees that the code “looks good” to them. • Approval from one of the code owners that the code is appropriate for this par‐ ticular part of the codebase (and can be checked into a particular directory). This approval might be implicit if the author is such an owner. Google’s codebase is a tree structure with hierarchical owners of particular directories. (See Chapter 16). Owners act as gatekeepers for their particular directories. A change might be proposed by any engineer and LGTM’ed by any other engineer, but an owner of the directory in question must also approve this addition to their part of the codebase. Such an owner might be a tech lead or other engineer deemed expert in that particular area of the codebase. It’s generally up to each team to decide how broadly or narrowly to assign ownership privileges. • Approval from someone with language “readability”3 that the code conforms to the language’s style and best practices, checking whether the code is written in the manner we expect. This approval, again, might be implicit if the author has such readability. These engineers are pulled from a company-wide pool of engineers who have been granted readability in that programming language. Although this level of control sounds onerous—and, admittedly, it sometimes is— most reviews have one person assuming all three roles, which speeds up the process quite a bit. Importantly, the author can also assume the latter two roles, needing only an LGTM from another engineer to check code into their own codebase, provided they already have readability in that language (which owners often do). These requirements allow the code review process to be quite flexible. A tech lead who is an owner of a project and has that code’s language readability can submit a code change with only an LGTM from another engineer. An intern without such authority can submit the same change to the same codebase, provided they get appro‐ val from an owner with language readability. The three aforementioned permission “bits” can be combined in any combination. An author can even request more than one LGTM from separate people by explicitly tagging the change as wanting an LGTM from all reviewers. In practice, most code reviews that require more than one approval usually go through a two-step process: gaining an LGTM from a peer engineer, and then seeking approval from appropriate code owner/readability reviewer(s). This allows the two roles to focus on different aspects of the code review and saves review time. The pri‐ mary reviewer can focus on code correctness and the general validity of the code change; the code owner can focus on whether this change is appropriate for their part 3 At Google, “readability” does not refer simply to comprehension, but to the set of styles and best practices that allow code to be maintainable to other engineers. See Chapter 3. 168 | Chapter 9: Code Review

of the codebase without having to focus on the details of each line of code. An approver is often looking for something different than a peer reviewer, in other words. After all, someone is trying to check in code to their project/directory. They are more concerned with questions such as: “Will this code be easy or difficult to maintain?” “Does it add to my technical debt?” “Do we have the expertise to maintain it within our team?” If all three of these types of reviews can be handled by one reviewer, why not just have those types of reviewers handle all code reviews? The short answer is scale. Separating the three roles adds flexibility to the code review process. If you are working with a peer on a new function within a utility library, you can get someone on your team to review the code for code correctness and comprehension. After several rounds (per‐ haps over several days), your code satisfies your peer reviewer and you get an LGTM. Now, you need only get an owner of the library (and owners often have appropriate readability) to approve the change. Ownership Hyrum Wright When working on a small team in a dedicated repository, it’s common to grant the entire team access to everything in the repository. After all, you know the other engi‐ neers, the domain is narrow enough that each of you can be experts, and small num‐ bers constrain the effect of potential errors. As the team grows larger, this approach can fail to scale. The result is either a messy repository split or a different approach to recording who has what knowledge and responsibilities in different parts of the repository. At Google, we call this set of knowledge and responsibilities ownership and the people to exercise them owners. This concept is different than possession of a collection of source code, but rather implies a sense of stewardship to act in the company’s best interest with a section of the codebase. (Indeed, “stewards” would almost certainly be a better term if we had it to do over again.) Specially named OWNERS files list usernames of people who have ownership respon‐ sibilities for a directory and its children. These files may also contain references to other OWNERS files or external access control lists, but eventually they resolve to a list of individuals. Each subdirectory may also contain a separate OWNERS file, and the relationship is hierarchically additive: a given file is generally owned by the union of the members of all the OWNERS files above it in the directory tree. OWNERS files may have as many entries as teams like, but we encourage a relatively small and focused list to ensure responsibility is clear. Ownership of Google’s code conveys approval rights for code within one’s purview, but these rights also come with a set of responsibilities, such as understanding the code that is owned or knowing how to find somebody who does. Different teams have How Code Review Works at Google | 169

different criteria for granting ownership to new members, but we generally encourage them not to use ownership as a rite of initiation and encourage departing members to yield ownership as soon as is practical. This distributed ownership structure enables many of the other practices we’ve out‐ lined in this book. For example, the set of people in the root OWNERS file can act as global approvers for large-scale changes (see Chapter 22) without having to bother local teams. Likewise, OWNERS files act as a kind of documentation, making it easy for people and tools to find those responsible for a given piece of code just by walking up the directory tree. When new projects are created, there’s no central authority that has to register new ownership privileges: a new OWNERS file is sufficient. This ownership mechanism is simple, yet powerful, and has scaled well over the past two decades. It is one of the ways that Google ensures that tens of thousands of engi‐ neers can operate efficiently on billions of lines of code in a single repository. Code Review Benefits Across the industry, code review itself is not controversial, although it is far from a universal practice. Many (maybe even most) other companies and open source projects have some form of code review, and most view the process as important as a sanity check on the introduction of new code into a codebase. Software engineers understand some of the more obvious benefits of code review, even if they might not personally think it applies in all cases. But at Google, this process is generally more thorough and wide spread than at most other companies. Google’s culture, like that of a lot of software companies, is based on giving engineers wide latitude in how they do their jobs. There is a recognition that strict processes tend not to work well for a dynamic company needing to respond quickly to new technologies, and that bureaucratic rules tend not to work well with creative profes‐ sionals. Code review, however, is a mandate, one of the few blanket processes in which all software engineers at Google must participate. Google requires code review for almost4 every code change to the codebase, no matter how small. This mandate does have a cost and effect on engineering velocity given that it does slow down the introduction of new code into a codebase and can impact time-to-production for any given code change. (Both of these are common complaints by software engineers of strict code review processes.) Why, then, do we require this process? Why do we believe that this is a long-term benefit? 4 Some changes to documentation and configurations might not require a code review, but it is often still pref‐ erable to obtain such a review. 170 | Chapter 9: Code Review

A well-designed code review process and a culture of taking code review seriously provides the following benefits: • Checks code correctness • Ensures the code change is comprehensible to other engineers • Enforces consistency across the codebase • Psychologically promotes team ownership • Enables knowledge sharing • Provides a historical record of the code review itself Many of these benefits are critical to a software organization over time, and many of them are beneficial to not only the author but also the reviewers. The following sec‐ tions go into more specifics for each of these items. Code Correctness An obvious benefit of code review is that it allows a reviewer to check the “correct‐ ness” of the code change. Having another set of eyes look over a change helps ensure that the change does what was intended. Reviewers typically look for whether a change has proper testing, is properly designed, and functions correctly and effi‐ ciently. In many cases, checking code correctness is checking whether the particular change can introduce bugs into the codebase. Many reports point to the efficacy of code review in the prevention of future bugs in software. A study at IBM found that discovering defects earlier in a process, unsur‐ prisingly, led to less time required to fix them later on.5 The investment in the time for code review saved time otherwise spent in testing, debugging, and performing regressions, provided that the code review process itself was streamlined to keep it lightweight. This latter point is important; code review processes that are heavy‐ weight, or that don’t scale properly, become unsustainable.6 We will get into some best practices for keeping the process lightweight later in this chapter. To prevent the evaluation of correctness from becoming more subjective than objec‐ tive, authors are generally given deference to their particular approach, whether it be in the design or the function of the introduced change. A reviewer shouldn’t propose 5 “Advances in Software Inspection,” IEEE Transactions on Software Engineering, SE-12(7): 744–751, July 1986. Granted, this study took place before robust tooling and automated testing had become so important in the software development process, but the results still seem relevant in the modern software age. 6 Rigby, Peter C. and Christian Bird. 2013. “Convergent software peer review practices.” ESEC/FSE 2013: Pro‐ ceedings of the 2013 9th Joint Meeting on Foundations of Software Engineering, August 2013: 202-212. https:// dl.acm.org/doi/10.1145/2491411.2491444. Code Review Benefits | 171

alternatives because of personal opinion. Reviewers can propose alternatives, but only if they improve comprehension (by being less complex, for example) or functionality (by being more efficient, for example). In general, engineers are encouraged to approve changes that improve the codebase rather than wait for consensus on a more “perfect” solution. This focus tends to speed up code reviews. As tooling becomes stronger, many correctness checks are performed automatically through techniques such as static analysis and automated testing (though tooling might never completely obviate the value for human-based inspection of code—see Chapter 20 for more information). Though this tooling has its limits, it has definitely lessoned the need to rely on human-based code reviews for checking code correctness. That said, checking for defects during the initial code review process is still an inte‐ gral part of a general “shift left” strategy, aiming to discover and resolve issues at the earliest possible time so that they don’t require escalated costs and resources farther down in the development cycle. A code review is neither a panacea nor the only check for such correctness, but it is an element of a defense-in-depth against such problems in software. As a result, code review does not need to be “perfect” to achieve results. Surprisingly enough, checking for code correctness is not the primary benefit Google accrues from the process of code review. Checking for code correctness generally ensures that a change works, but more importance is attached to ensuring that a code change is understandable and makes sense over time and as the codebase itself scales. To evaluate those aspects, we need to look at factors other than whether the code is simply logically “correct” or understood. Comprehension of Code A code review typically is the first opportunity for someone other than the author to inspect a change. This perspective allows a reviewer to do something that even the best engineer cannot do: provide feedback unbiased by an author’s perspective. A code review is often the first test of whether a given change is understandable to a broader audience. This perspective is vitally important because code will be read many more times than it is written, and understanding and comprehension are critically important. It is often useful to find a reviewer who has a different perspective from the author, especially a reviewer who might need, as part of their job, to maintain or use the code being proposed within the change. Unlike the deference reviewers should give authors regarding design decisions, it’s often useful to treat questions on code com‐ prehension using the maxim “the customer is always right.” In some respect, any questions you get now will be multiplied many-fold over time, so view each question on code comprehension as valid. This doesn’t mean that you need to change your 172 | Chapter 9: Code Review

Pages:

Willington Island

Software Engineering at Google: Lessons Learned from Programming Over Time

Like this book? You can publish your book online for free in a few minutes!

Create your own flipbook

TOP SEARCH

business design fashion music health life sports home marketing children

Software Engineering at Google: Lessons Learned from Programming Over Time

Read the Text Version

Willington Island

TOP SEARCH

RELATED PUBLICATIONS