hckr.fyi // thoughts

Actionable Metrics to Make Work Visible and Improve Organizational Performance

by Michael Szul on

Back in May, my institution went through a little bit of a reorganization. The end result included a team dedicated to medical education software, and my job was to begin reviewing our team's overall process and workflow to find pockets of improvement, and to begin a minor transformation from an applications and custom software team, to one more focused on the business value of our medical education mission.

The difficulty in taking on a new management position is trying to get a proper look at resource utilization in order to project an accurate roadmap of delivery. Although I was a member of the team previously, I wasn't the manager, which meant project tracking was either a side project or held up to different standards.

When I joined the company almost 5 years ago, one of my primary goals was to transform the team and the way we build software into a Lean DevOps team (a fancy way for saying Lean software development with cross-functional teams). Although DevOps was a relatively new term back, adoption had started to accelerate.

The biggest obstacle any software team runs into is that all estimates are wrong. All of them. We call things "requirements" when what we really mean are ideas or hypotheses. Requirements are things that are required. If requirements change… then I guess they weren't actually required. The problem with Waterfall is that we start with these requirements, move through an elongated process, deliver at the end based on estimates, and then test, secure, and talk to the customers often at the last minute. But so-called "requirements" do change, and customers that aren't in technology don't understand terms like "scope creep" so likely won't understand why we can't change a request mid-Waterfall stream.

Even as teams move away from Waterfall, this idea of "estimates" remains--mostly as a side effect of the contractor-control model when IT was only a service to be bought, and not a part of the business. But estimates are always wrong. This means that even basic projections and roadmaps are always being shifted around. In an environment where deliverables are constantly changing dates, it's hard to accurately strategize long-term--even if that strategy is just a rough placeholder of initiatives. So why are deliverables constantly changing?

It's hard to answer any question of "why" without data to look at. In software, unfortunately, there is an awful lot of work that is invisible, so in order to make good decisions on project and resource planning, and chart a course for process improvement, we need to make that work visible.

A good starting point is Dominica DeGrandis' excellent Making Work Visible, which details five primary "time thieves" that are responsible for invisible work and inefficiencies:

All of these can have a significant impact on project timelines and resource utilization, but if we're going to cherry-pick the easiest one: What percentage of your daily/weekly/monthly work is spent on unplanned items? If you're working on unplanned items, then you can't be working on planned items, which will obviously impact your overall projections.

I actually started out looking at unplanned work in combination with a few other metrics: expedited work and emergency work (e.g., outages). Although unplanned work and expedited work can be one and the same, you can sometimes have unplanned work that isn't necessarily expedited (e.g., we add it to the Kanban as an item approved to pull, but it's still waiting for someone to pull), and sometimes expedited work isn't necessarily unplanned; It's simply moved up the timeline.

After tracking about 5 months worth of work items, ultimately I dropped tracking of expedited work and emergency work. Emergency work was being tracked in a different DevOps metric as a part of "time to recover" when an outage occurs, and expedited work didn't have a great enough variance from unplanned work to warrant the added cognitive cycles.

After 6 months of tracking, 46.2% of our time in that period was spent on unplanned work. In the past 3 months: 54.3%. That means that fully half of all work was being spent on unplanned work items. I'm fairly confident that this has a sizeable impact on project planning, as well as resource utilization.

Why so much unplanned work? That's the next question, but it's a question that we wouldn't even be asking if we didn't have the metrics first. Unplanned work can be the result of bugs in the software, poor overall project planning (conflicting prioritizes and frequent pivots), requests deemed important that bypass normal work intake and prioritization channels, reactionary behavior, etc. In our specific situation, tier-1 support requests are handled by a separate team, but our own team was receiving a large amount of support requests, so one of the areas where I narrowed focus was in the number of tier-1 support requests that my team was handling: 30%.

Now that isn't to say that the tier-1 support people aren't doing their job. Blame processes not people, right? I'd rather approach it as: What could we be doing to better enable them?

Although the Agile Manifesto tells us that we should value working software over comprehensive documentation, Lean manufacturing teaches us that your most important customer is your next customer downstream. If you're building services and applications, but another team is handling the first line of defense, they are one of your downstream customers that you need to pay special attention to and enable. Part of that is with appropriate documentation, which really consists of two questions:

One of the cornerstones of service delivery is documentation and training, so it stands to reason that improved documentation and regular exploratory training sessions need to be baked into any perceived hand-off process. What's more: That documentation needs to evolve to modern standards as a searchable, index-able knowledge base.

As IT organizations and departments move through the stages of low, medium, and high performance, those embedded in the medium performing teams often reach transition points where there is a void in areas that would normally propel you forward. In smaller shops, documentation, training, and testing are usually the items that fall off the table.

If appropriately searchable application and service documentation (as well as troubleshooting guides) are readily available, this should reduce the tier-1 support burden Furthermore, by treating the support team as an customer downstream, troubleshooting and administrative entry screens no longer become optional. If the tier-1 support can be reduced from 30% to 10%, that'll bring the overall unplanned work percentage down to around 30%. Even if there is only a 10% gain, it shouldn't be hard to justify the business value of making gains in these areas.

Hand-offs are always tricky; Meanwhile, project prioritization is usually dependent on outside teams with an equal seat at the table. And you're always likely to get requests that force you to realign your regular software sprints. However, there's another area that we can look at in the unplanned work category where the applications team can be in full control: Bugs.

Nobody wants to release software bugs, but when you're trying to move at a fast pace with changing requirements and with a diverse team, bugs are inevitable… but a lot of bugs doesn't have to be. One of the principles of Lean DevOps is to build quality in. Two ways to build quality in that place the responsibility on the development team are:

These are components of a much larger scope that deals with creating an automated DevOps pipeline. I won't go into too much detail about a "pipeline" here, but by creating an automated DevOps pipeline, when developers check in new code, that code can be built, tested, and evaluated automatically before deploying to any environment. When you add automated unit testing and code coverage to your pipeline, you can set up rules so that no build passes or get deployed without 100% of the automated tests passing and without hitting a certain threshold of code coverage. Let's say 80% of lines.

Setting a code coverage threshold does two thing:

From a DevOps angle, better code quality means that as deployment frequency (deployments of new features into production) increases, the change fail rate (bugs introduced) stays low, reducing the number of bugs and regression bugs. This frees the testers (if your team has any) to do more exploratory and functional testing. It'll also reduce the number of unplanned work items by eliminating undocumented bugs and hopefully a few fire drills.

In a drive to transform a team from one building custom applications to one focusing on outcome-driven business value, the first thing you need to do is get a handle on what work you're actually doing on a day-to-day basis, and how you can improve the process around that work. In our case, improving code quality (we had automated unit tests; just not enough) to drive up deployment frequency, reduce batch sizes, and drive down change fail rates was paramount to reducing the amount of unplanned work. Meanwhile, improving the documentation and the administrative screens for the support team would reduce our tier-1 support requests. With both these things in focus, we're targeting a 30% reduction in unplanned work over the next 6 month as stage one of our process improvement initiative.

The true beauty of this is that we really on tackled one of the 5 time thieves mentioned earlier, so our drive towards process improvement is still in it's infancy, as is our transformation into a full Lean DevOps team. There are many more exciting things on the way.