📈 Optimization ROI in Software Engineering

I was recently reminded of a popular XKCD comic:

The core idea is simple – based on how often you do a task (added over five years), if you could optimize how quickly this task could be done – how much time can you spend optimizing before the tradeoff is no longer worth it?

I think this table has some unintuitive values! For example, if there’s a task you do 50/day (perhaps a git status, for the technically inclined?) – then even if it meant only shaving 1 second off of it, you should spend up to a whole day optimizing it!

I’ve been curious to apply this idea to my life even further – over shorter time horizons, or when the optimization might affect many people. I find that many of the best engineers I’ve ever worked with know and work by these tradeoffs intuitively, but I am not one of them :) I default to math and technology to help me get better – on that note, let’s write some simple equations!

Optimizing while Debugging

Software engineers are no strangers to the concept of premature optimization. Especially in startup-land, it should be treated with dread; any time spent unsuccessfully or uselessly optimizing is time not spent on the mission-critical. So, when working to debug an error, it is hard to guess how much time to spend making faster repro’s.

Let’s cast this to a simple equation – the max time you can spend making the repro more efficient is

time_spent_improving

< num_times_you_will_run_the_repro * (original_repro_times - new_repro_time)

Here we can see: if we guess we’ll run the repro a dozen times, and it takes 2 minutes but could be brought down to one, we should spend about a quarter of an hour doing so!

Similarly, if a repro takes 100 minutes, and we’ll again do it a dozen times, but think we can bring it down to 60 minutes, we ought to spend a whole workday optimizing!

The calculator assumes that the optimization always happens if you put in the break-even hours– this is not always true. Sometimes your optimization idea doesn’t work, or it’s not as fast as you thought it would be. To handle this uncertainty, I just multiply the value by some uncertainty factor (often 0.8 or 0.4 based on how confident I feel).

Making Onboarding Faster

At large tech-co’s, we’re onboarding ourselves onto tools and workflows near-constantly. Documentation is often out-dated, because of something I call the Insider’s Fallacy – once you are onboarded, you’re not stepping on these rakes, and you never will again, so you slowly forget all the times you got hit in the face!

But the tradeoffs here are similar – if it takes the average person about 2 hours to onboard to a tool, and you can make it 1 hour by making a nice starter script or doc; there are 2 people onboarding onto the tool a week (let’s say for half-a-year) – the calculator naively says that you should spend a whole work week optimizing that.

For such instances, I factor in a discount – you can think of this as a selfishness discount or as a time value asymmetry discount (your already-onboarded time is probably a bit more valuable than folks onboarding at the moment). You can pick any arbitrary number – I typically pick 0.5, but every person and company varies.

Improving Shared Tools

Improving the quality of life on tools that large amounts of people in the company use can be a huge company-wide productivity (and happiness!) boost.

Imagine a git pre-commit that takes 5 seconds because of a slow linting check. Making this faster or moving this to a CI check might bring our commit time down to 1 second. Assuming that thirty people commit at least 10 times a day (over 2 months), there’s reason enough to spend up to 20 hours (or 10 hours if you apply the selfishness discount) improving this.

Just like so, if you start looking, you’ll find many processes, used by many people, many times a day, that are slower than they have any right to be.

But What About Concurrency

Something that the entirety of this article has assumed is that users are busy-waiting in these processes. While this is sometimes true (launch a thing -> toggle off to scroll Slack is a time-honored workflow), many develop systems that allow them to toggle on and off workflows much more easily. I am admittedly subpar at this – I struggle to context shift quickly and cleanly. Regardless, below is some advice from the elders about how to do it well.

Allow Concurrency with Shared Resources

Often, you have two different tasks that share some resources to be worked on (like a code repository). I’ve seen great engineers have multiple copies of a repo locally, great git aliases, workflows that seamlessly launch things away to devboxes, or a bucket-load of tmux shells. They often have many parallel compute resources going simultaneously as well – launching multiple debugging ideas away to compare-and-contrast, for example.

Robust and Reliable Observability and Logging

I’ve noted that senior engineers often overindulge in observability in their setup – saving and logging state constantly. Excessive in the beginning, they never have to re-run experiments, and are able to flag problems and inefficiencies much more quickly. In true “fire-and-forget” spirit, many of them have notifications that ping them when a job is complete or taking over some time-limit, so they never have to worry about checking in on jobs.

Disconnected Tasks Co-Exist

I’ve also noted that context shifts are less painful when tasks that require different skills are overlapped – writing up a research report while debugging a node latency issue, or polishing a piece of documentation while trying to get a PR landed on main.

While all of this advice sounds great, applying it means facing friction, having to be thoughtful, and slow-downs in your process as you get comfortable – seeing their value might take some time. But I’m told that they are worth investing in, and it is on this promise that I’m trying to get better at these!

Tying this back to the calculator: I often bring the original repro time down to whatever fraction I think I won’t be able to overlap– 40 minutes down to ~20 minutes for me, and probably less for other folks :) Your mileage may vary!