Don't Change Hearts, Change Workflows
Hearts are fickle, CI failures are forever.
Say an engineering team wants to change the behavior of other engineering teams, and for a very good, big-picture reason. This happens often on horizontal platform teams or at wider architectural scope. I’ve been in a lot of conversations around goals of this type: encouraging better test coverage, considerations of web performance/accessibility, adopting a different coding paradigm (e.g. TypeScript), moving to a new framework, etc.
The good-faith approaches to these problems frequently center around education or inspiration:
- “Let’s send an email to the Engineering mailing list saying this is important.”
- “We’ll give a talk at Engineering All Hands.”
- “Let’s run monthly training.”
- “If we communicate the pitfalls that these behaviors create, others will see why we’re pushing this change and how it benefits everyone.”
Here’s a hard-learned truth: this doesn’t work. That isn’t to say that education / inspiration / teaching isn’t a core component of lasting change (it is), but changing sentiment alone doesn’t guarantee outcomes.
These alternatives suffer from a few critical flaws:
- They’re timely. You have to read the email. You have to see the Slack message. You have to attend the All Hands. What if you were on PTO? What if you get 100 emails a day and don’t read them all? What if you joined the company the day after All Hands? If a critical message is contingent upon the circumstances of “you had to be there,” it is brittle.
- Continued manual intervention doesn’t scale. For example: monthly training. If your team size is limited, are you really going to dedicate the time towards training for and proctoring a recurring meeting? Assuming everyone’s calendars fill with meetings across an organization, what stops your recurring training from becoming calendar noise? If nobody attends your recurring meeting for months, is your team going to stick with it? In my experience, the answer is overwhelmingly no.1
While these publicity campaigns are accompanied by short-term optimism, I often see well-meaning teams burn out in frustration in the long term with a bad-faith conclusion: “nobody cares.” And I don’t think that’s the real story. I have yet to find a good engineer who doesn’t care about well-written, performant, accessible code. In a vacuum they do care, but in practice they probably don’t care as much as you do, because their incentives don’t align to yours. In a universe of fixed time and attention, they’re going to pursue their own localized incentives over those of horizontal incentives, because those localized incentives are what get them promoted. For product teams, that usually means getting a product out to customers as soon as possible, and manual interventions and trainings slow them down towards these goals. This doesn’t make these engineers bad, it makes them rational.
Consequently I’ve developed a mantra in response to these circumstances:
“If you want to change engineer behavior, don’t change hearts, change workflows.”
In technical terms: codify your desired behaviors into system behavior as much as possible. What are the high leverage points where work must pass through? Two that come to mind are a Design System and CI. A Design System is a great place to implement product behaviors: performance, accessibility, localization, etc. CI is a great place to enforce code-level behaviors: formatting, test coverage, bespoke checks to ensure characteristics on incoming commits, etc. Using both is aligned to an engineer’s incentives - you can’t get code to production otherwise! And instead of requiring engineers to keep all of these higher level concerns in mind at the same time, they get them for “free.” All engineers have to do is use the tools available to them.
Another reason I like this is because it scales to new hires very well. Sometimes incumbent coworkers will push back on workflow changes (see: new CI checks), especially if they’re perceived to slow down existing processes. And there’s a delicate balance here to keep in mind. But if you’re absolutely confident that the short-term trade-off in workflow speed is worth the long-term benefit of technical correctness, any current pushback discounts the reality that when new engineers join the company, this new workflow is normal. New hires assume the norms present on day one are the norms, and that’s a real mechanism by which technical and cultural change propagates over time.
Caveats
- CI, which is a singular bottleneck to getting code to production, is an extremely high leverage point. The pace of CI is the pace of engineering, and when it’s fast and reliable versus slow and flaky, engineering pace goes in lockstep. This means that if you’re going to add steps to a CI process, you must ensure those steps are high value and reliable. Adding faulty or flaky checks will quickly undermine the purpose of those checks and the velocity of engineering. Approach workflow changes with serious scrutiny.
- If you enforce behavior in CI, you must make it easy for an engineer who runs into an error state to resolve their own error quickly. Error states should be actionable and come with self-service documentation.
- You still need to invest in education / publicity. What is the strategic or cultural goal that motivates these workflows? Make sure that’s well communicated and easily accessible, and ideally report on the progress of the long-term goal on a regular cadence.
What if you can’t codify?
One common piece of pushback I hear to this advice is: “well, that might work for some teams, but our problems require manual intervention.” And I agree that this is sometimes true - for example, determining the accessibility of a UI can’t be fully automated. But I also think teams can often save significant time by codifying low hanging fruit into tooling rather than relying solely on manual intervention, even if the tooling isn’t exhaustive.
For a real-world example, as part of the growth of the International engineering effort we had to develop a new architectural pattern for bootstrapping our language tooling into frontend applications. While this pattern ultimately led to successful product outcomes, it was extremely hard to debug when things went wrong, and required manual intervention on behalf of an engineer on our team to dive deep into the resulting stack trace. This was fine for a while, but it began to show up more frequently as teams begun to build new greenfield applications. The manual triage required from our team became frustrating and time consuming.
One day I had a conversation with one of my teammates:
Me: “When you debug these things, are you mentally going through a series of checks?”
Them: “Yeah. I have a series of things I look for, roughly in order of prevalence.”
Me: “Can code do this check for us?”
Them: “Hmm… I think so. Some of them, yeah. Not all of the things, but maybe the top 2-3 failure modes.”
Me: “What if we created a script we could ask engineers to run on their machine to check through those failure modes first before they ask us for help?”
Them: “I’ll try it out.”
That script - i18n-diagnose
- linked to internal documentation to resolve common error states. After putting the script into use, our inbound interrupt volume went down, and we had a place to add more debugging utilities as we encountered new problems. And in response to some error states accidentally leaking into production, we integrated i18n-diagnose
into our CI runs.
Is it perfect? No. In the gnarliest cases there’s still some manual intervention required. But did it empower other teams to solve their own errors and free our ICs up to work on our most important priorities rather than be overburdened with manual triage? Absolutely.
Footnotes
-
If there is value to training on a cadence, recorded training with an in-person Q&A component is a much more sustainable path forward. ↩