Software Systems need ‘Skin in the Game’

Key Takeaways:

  • Skin in the game means decision-makers bear the consequences of their decisions, both the risks and the rewards.
  • All systems need this attribute to drive evolutionary pressure
  • Software development has suffered greatly from a long history of trying to separate consequences from decisions in the name of efficiency
  • Modern software practices and management systems have reversed the trend and put skin back in the game 
  • On-call engineering is the engineering practice to put skin in the game and balance the upside with the downside.

In his famous lectures on physics, Richard Feynman explains that the test of all knowledge in science is the experiment. This is the sole source of scientific truth. But where does the knowledge to experiment on come from? In physics, the thinking process is so difficult that there is a division of labor – theoretical physicists dream up new ideas, and experimental physicists design the experiments to test them.

This division of labor between ‘thinking’ and ‘doing’ might make sense if you are trying to understand the origins of the universe, but here on earth – in the real world – we do not have the luxury of letting others dream up experiments for our own lives. Consequential decisions need to be made by those who pay for the consequences, by the people with skin in the game

Skin in the game drives evolutionary pressures across a system by ensuring decisions come with both the rewards and the risks. When bad decisions are made in a system designed with skin in the game, selection processes will either alter the decision or eliminate the decision-makers. Remove skin in the game, and evolutionary pressure goes with it – bad decisions can be made without consequences.

In software, where our work is made up entirely of systems building, we have continuously missed this crucial point, lining up countless teams in front of the firing squad of asymmetry.

Throughout the history of software, managers have tried to design methods of engineering that pull everything they can away from the teams writing code. Theoretically this should make things faster: “With the division of labor we can create a software factory!” But in reality, this creates a system with artificial incentives (e.g. lines of code, test coverage) that have nothing to do with the value being delivered to customers.

There has been a compulsion to think of only “the software” as a “the system”, and that once decisions (in the form of code) are put into the system, they take on a life of their own and can exist independently of the decision-makers.

But software systems are sociotechnical and our decisions do not detach from the decision-makers, only the consequences do. The fitness of the system needs a connection to consequences. This is what we have been struggling to figure out in our methodology for building software.

Systems of Software Engineering

Software is made through the accumulation of a great many small decisions from different people over a period of time. We write those decisions into code, but they are based on imperfect information and need to be updated as we learn. If learning feedback loops are in place then the good decisions will stick around, the bad ones will get replaced, and the system will evolve.

The learning comes from all over, not just from writing code but also from designing, planning, testing, monitoring, securing, operating, maintaining, demonstrating, using the system.

The great folly of software engineering is to think you can anticipate what will be learned in those other activities without having to participate in them, because that is where the risk of our decisions is uncovered, sometimes quite dramatically.

Systems of engineering that move those activities away from developers remove the connection to the risk they are creating, which destroys evolutionary pressure and generates software systems that are fragile, expensive, and difficult to maintain.

We learned many hard lessons, and they run up and down the development stack. To name a few obvious ones:

  • It’s why we do DevOps. “You build it, you run it” – Werner Vogels, CTO of Amazon, saw this clearly: if you don’t run the code you write, you have no skin in the game. When one group does the building and another group does the running, you get the pernicious separation of thinking and doing, the division of decision-making and risk-bearing.
  • It’s why we killed QA teams. When one team is writing code and the other team is figuring out how to test it, you remove the direct contact between how it works and how it fails. Outsourcing all the “testing” does the exact opposite of what is intended: it creates bad software by removing the evolutionary pressure needed to stop bad code from entering the system. 
  • It’s why we build products, not projects. Projects have one group doing the planning, and another group doing the work. But it’s easy to set a date when you don’t have to figure out how to meet it. Ironically, separating the planning from the “doing” removes predictability from the system. Developers end up burning out or shipping bad systems to meet deadlines dreamed with no skin in the game.

And we are still learning many hard lessons

  • Why is security still so poor in software? Because it’s mostly compliance frameworks created by checklist mafia with zero skin in the game. Cringe hard when you hear about how security is now a C-level problem. What a terrible place to put the problem, in the hands furthest away from the people doing the doing.

But overall the industry trend is to create systems of engineering that put skin back in the game. Over the past two decades, synthetic management systems and technical practices have emerged that do just that. Full-stack teams that start together with collective ownership of outcomes. Practices like Agile, DevOps, Continuous Integration and Continuous Delivery work together to remove encumbrances on the ability to own the risks that we create with our systems.

The important question to ask is, how can we maximize skin in the game? How can we give developers the most direct contact with the risks they create? There is one practice that provides such contact, having a privileged position that creates a unique form of skin in the game that is unparalleled in the business world.

The Practice of On-call Engineering

Going “on-call” means that on some regularly-occurring cadence, you put down your regular work and spend a week working directly on the system. This means two things:

  1. Carrying a pager, responding to issues in real time as they happen, and pulling in others if something serious is happening.
  2. Doing the work needed to maintain the system, tuning monitors and alerts, working on regularly occurring tasks, or making small improvements to optimize the on-call experience. 

For a system where developers go on-call, the connection to risk they create could not be greater. If monitors are generating a lot of unreasonable or unactionable alerts, if components are loaded up with manual tasks and un-automatable issues, if the system is complicated and difficult to debug, it is the people who created those problems that must suffer. 

On-call engineering creates skin in the game in a way that deeply shapes our understanding of the world.

“What matters isn’t what a person has or doesn’t have; it is what he or she is afraid of losing.”

Nassim Nicholas Taleb

An incomplete list of things that one loses when one goes on-call for a hard-to-operate system (in no particular order): sleep, weekends, evenings, time working on things the enjoy, general well-being. To an on-call team, the idea of putting these things at risk is morally repugnant.

And so, on-call engineering unites creates a powerful set of engineering ethics, a moral code of conduct that stems from overcoming adversity and the constant threat of ruin. Taking on risk wantonly or haphazardly is deeply offensive, in a way can be hard to understand by people who have not gone on call before.

Get Your Hands Dirty

There are different attempts to create skin in the game, but by far the most valuable asset is learning through experience. Time spent in the hot seat.

Until you have actually experienced the impact of going on-call for a software system, it is hard to imagine just how deeply it impacts your decision-making. Skin in the game therefore implies that we need people to come down from our ivory towers and experience the world:

“The knowledge we get by tinkering, via trial and error, experience, and the workings of time, in other words, contact with the earth, is vastly superior to that obtained through reasoning.”

Taleb, Skin in the Game, p.7

If you want to make decisions about a product’s design or technology or architecture or technical debt or timelines or backlog or security, then join that team. If you can’t, you can offer your opinion, but you don’t get a vote. Think Pigs and Chickens, an excellent fable that was unfortunately removed from the Scrum guide 10 year ago, with nothing to replace the critical lesson it provides.

For software developers that want to build great systems, owning risk is an inescapable moral obligation. They implicitly know that most powerful force in systems-building is the evolutionary pressure brought by a direct connection to consequences. When we acknowledge this, we can use that force to drive the evolution of our systems and generate the kind of value that customers pay for, and engineers love. 


Notes: 
- This article was originally published in InfoQ
- The art featured in this post is "The Battle of Britain" by Paul Nash, depicting a dog fight in 1941. Fighter pilots have the ultimate skin in the game. 

Published by JohnRauser

Eng Leader @ Cisco

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: