Software Systems need ‘Skin in the Game’

Key Takeaways:

  • Skin in the game means decision-makers bear the consequences of their decisions, both the risks and the rewards.
  • All systems need skin in the game to drive evolutionary pressure
  • Software development that has suffered greatly from a long history of trying to separate consequences from decisions in the name of efficiency
  • Modern software practices and management systems have reversed the trend and put skin back in the game 
  • On-call engineering is the quintessential engineering practice to put skin back in the game and balance the upside with the downside.

In science, the test of all knowledge is the experiment. This is the sole source of scientific truth. But where does the knowledge to experiment on come from? In physics, the thinking process is so difficult that there is a division of labor – theoretical physicists dream up new ideas, and experimental physicists design the experiments to test them.

This division of labor between ‘thinking’ and ‘doing’ might make sense if you are trying to understand the origins of the universe, but here on earth – in the real world – we do not have the luxury of letting others dream up experiments for our own lives. Consequential decisions need to be made by those who pay for the consequences, by the people with skin in the game

Skin in the game drives evolutionary pressures across a system by ensuring decisions come with both the rewards and the risks. When bad decisions are made in a system designed with skin in the game, selection processes will either alter the decision or eliminate the decision-makers. Remove skin in the game, and evolutionary pressure goes with it – bad decisions can be made without consequences.

In software, where our work is made up entirely of systems building, we have continuously missed this crucial point, lining up countless teams in front of the firing squad of asymmetry.

Throughout the history of software, managers have tried to design methods of engineering that pull everything they can away from the teams writing code. Theoretically this should make things faster: “With the division of labor we can create a software factory!” But in reality, this creates a system with artificial incentives (e.g. lines of code, test coverage) that have nothing to do with the value being delivered to customers.

There has been a compulsion to think of only “the software” as a “the system”, and that once decisions (in the form of code) are put into the system, they take on a life of their own and can exist independently of the decision-makers.

But software systems are sociotechnical and our decisions do not detach from the decision-makers, only the consequences do. The fitness of the system needs a connection to consequences. This is what we have been struggling to figure out in our methodology for building software.

Systems of Software Engineering

Software is made through the accumulation of a great many small decisions from different people over a period of time. We write those decisions into code, but they are based on imperfect information and need to be updated as we learn. If learning feedback loops are in place then the good decisions will stick around, the bad ones will get replaced, and the system will evolve.

The learning comes from all over, not just from writing code but also from designing, planning, testing, monitoring, securing, operating, maintaining, demonstrating, using the system.

The great folly of software engineering is to think you can anticipate what will be learned in those other activities without having to participate in them, because that is where the risk of our decisions is uncovered, sometimes quite dramatically.

Systems of engineering that move those activities away from developers remove the connection to the risk they are creating, which destroys evolutionary pressure and generates software systems that are fragile, expensive, and difficult to maintain.

We learned many hard lessons, and they run up and down the development stack. To name a few obvious ones:

  • It’s why we do DevOps. “You build it, you run it” – Werner Vogels, CTO of Amazon, saw this clearly: if you don’t run the code you write, you have no skin in the game. When one group does the building and another group does the running, you get the pernicious separation of thinking and doing, the division of decision-making and risk-bearing.
  • It’s why we killed QA teams. When one team is writing code and the other team is figuring out how to test it, you remove the direct contact between how it works and how it fails. Outsourcing all the “testing” does the exact opposite of what is intended: it creates bad software by removing the evolutionary pressure needed to stop bad code from entering the system. 
  • It’s why we build products, not projects. Projects have one group doing the planning, and another group doing the work. But it’s easy to set a date when you don’t have to figure out how to meet it. Ironically, separating the planning from the “doing” removes predictability from the system. Developers end up burning out or shipping bad systems to meet deadlines dreamed with no skin in the game.

And we are still learning many hard lessons

  • Why is security still so poor in software? Because it’s mostly compliance frameworks created by checklist mafia with zero skin in the game. Cringe hard when you hear about how security is now a C-level problem. What a terrible place to put the problem, in the hands furthest away from the people doing the doing.

But overall the industry trend is to create systems of engineering that put skin back in the game. Over the past two decades, synthetic management systems and technical practices have emerged that do just that. Full-stack teams that start together with collective ownership of outcomes. Practices like Agile, DevOps, Continuous Integration and Continuous Delivery work together to remove encumbrances on the ability to own the risks that we create with our systems.

The important question to ask is, how can we maximize skin in the game? How can we give developers the most direct contact with the risks they create? There is one practice that provides such contact, having a privileged position that creates a unique form of skin in the game that is unparalleled in the business world.

The Practice of On-call Engineering

Going “on-call” means that on some regularly-occurring cadence, you put down your regular work and spend a week working directly on the system. This means two things:

  1. Carrying a pager, responding to issues in real time as they happen, and pulling in others if something serious is happening.
  2. Doing the work needed to maintain the system, tuning monitors and alerts, working on regularly occurring tasks, or making small improvements to optimize the on-call experience. 

For a system where developers go on-call, the connection to risk they create could not be greater. If monitors are generating a lot of unreasonable or unactionable alerts, if components are loaded up with manual tasks and un-automatable issues, if the system is complicated and difficult to debug, it is the people who created those problems that must suffer. 

On-call engineering creates skin in the game in a way that deeply shapes our understanding of the world.

“What matters isn’t what a person has or doesn’t have; it is what he or she is afraid of losing.”

Nassim Nicholas Taleb

An incomplete list of things that one loses when one goes on-call for a hard-to-operate system (in no particular order): sleep, weekends, evenings, time working on things the enjoy, general well-being. To an on-call team, the idea of putting these things at risk is morally repugnant.

And so, on-call engineering unites creates a powerful set of engineering ethics, a moral code of conduct that stems from overcoming adversity and the constant threat of ruin. Taking on risk wantonly or haphazardly is deeply offensive, in a way can be hard to understand by people who have not gone on call before.

Get Your Hands Dirty

There are different attempts to create skin in the game, but by far the most valuable asset is learning through experience. Time spent in the hot seat.

Until you have actually experienced the impact of going on-call for a software system, it is hard to imagine just how deeply it impacts your decision-making. Skin in the game therefore implies that we need people to come down from our ivory towers and experience the world:

“The knowledge we get by tinkering, via trial and error, experience, and the workings of time, in other words, contact with the earth, is vastly superior to that obtained through reasoning.”

Taleb, Skin in the Game, p.7

If you want to make decisions about a product’s design or technology or architecture or technical debt or timelines or backlog or security, then join that team. If you can’t, you can offer your opinion, but you don’t get a vote. Think Pigs and Chickens, an excellent fable that was unfortunately removed from the Scrum guide 10 year ago, with nothing to replace the critical lesson it provides.

For software developers that want to build great systems, owning risk is an inescapable moral obligation. They implicitly know that most powerful force in systems-building is the evolutionary pressure brought by a direct connection to consequences. When we acknowledge this, we can use that force to drive the evolution of our systems and generate the kind of value that customers pay for, and engineers love. 

- This article was originally published in InfoQ
- The art featured in this post is "The Battle of Britain" by Paul Nash, depicting a dog fight in 1941. Fighter pilots have the ultimate skin in the game. 

Software Development and the Psychology of Money

The Psychology of Money: Timeless lessons on wealth, greed, and happiness:  Housel, Morgan: 9780857197689: Books -

I picked up The Psychology of Money because it’s light, had pretty good reviews and looked like a quick read. All of these promises came true, but I didn’t expect to apply the lessons to software!

The book mainly revolves around two things that are hard to integrate into our every day thinking – 1) compounding and 2) tails.

The first comes down to this: all growth is driven by compounding, which always takes time. The impatient among us will never succeed at the long game of growth, only at the short game of gambling. And it’s not worth gambling. Why? That’s the second part.

The second pertains to risk and ruin (and from which the author borrows heavily from Nicholas Taleb. In other words, it’s a good book ;). The essence is that all long games are subject to tail events – rare, catastrophic events that mean ruin for the unprepared – and they will happen. Unfortunately we often have a very bad idea of where we actually sit inside probabilities (or we just ignore them).

What does this have to do with software?

Firstly, we are all looking for growth. Whether it is in the products we are building, or in our careers, networks, knowledge, etc. The principle of compounding applies here:

  • In the two-second lean tasks you do to make your product better. It pays to take the time to make improvements, remove that technical debt, document those decisions, refactor your old code. You won’t get much today, but the benefits compound over time as you and your team make a habit of small improvements.
  • In the daily tasks you do to learn and grow. Everyone is going to agree that it’s easier to spend an hour on Netflix than it is to take the time to learn a new thing, but an hour spent is an hour gone for ever. The dividends paid on an hour learned are huge in the long game. You are presented with this decision every single day. It’s yours and yours alone to make.
  • In the conversations you have with your teammates, your peers, your manager, your manager’s manager, your industry contacts and so on. You need to make an investment in these. Again the pay off might not happen today (in fact it may be somewhat uncomfortable today, since you have to figure out what to talk about!). But you should see these as long term investments that will be worth it when you run into problems down the road. Speaking of which:

Secondly, on ruin. There are all kinds of disasters that we need to protect ourselves from in software: late integrations, big bangs, building the wrong thing, outages. But there are also a variety of different practices that we can use to buy insurance: thin slicing, writing tests, incremental development, monitor driven development, and so on. When you look at these practices as insurance against ruin, and you understand a bit about probability and risk, it’s easy to make the right decision to use them.

The psychology of “money” is not really about money at all. It’s about taking a realistic approach to the way things actually work in life. And so it takes a lot of the mystery out of why things work the way they do, and the lessons can be applied to a lot of areas of life. These are lessons that pass the test of time, and that we can use in our every day decisions.

Pointing and Calling

A small leadership lesson.

Tokyo train systems are legendary for running on time. How do they do it? A key element is pointing and calling – when taking an action, the operator points and calls it out so that everyone knows what is happening.

It may seem like drawing attention to simple things that are already obvious, but that’s exactly the point – simple things can cause really big problems when they fail. In a world where everyone cannot see everything, pointing and calling can provide a shared set of eyes.

The idea is to get in the habit of making sure that the obvious is actually obvious … by starting with simple things. It takes 1 second, it doesn’t hurt anyone, but it can world of difference.

And it’s an important lesson for us in software.

We often have to make decisions that range from very simple to very complicated. And since our work is cognitive, it’s easy for the “obvious” to get lost. What you think is obvious to others easily turns out to be not at all.

It pays to remember that it’s worth over-communicating in cognitive work such as software delivery. Similar to the “I intend to … “ practice from Turn The Ship Around, we need to be sure we are being clear about what we are doing, drawing attention to our actions so we can get feedback from others. This is a key element of being responsible with our autonomy.

Software teams already practice versions of pointing and calling. For example, we use ceremonies to draw attention to what we are doing, and in those ceremonies we use various practices to make our work visible. This is how we have learned to cope in a world where we can’t see anything (because it’s all in our heads).

And it can be as simple as saying “I’m moving this card to done” as you are transitioning a card in standup, then a second of silence to give someone the opportunity to jump in, “Well, actually ..”

Sometimes this practice is not easy to do, though:

  • Sometimes we point and call, but don’t actually execute. When this happens, people will have unmet expectations.
  • Sometimes we execute, but forget to point and call. Now we have lost the opportunity to get help, to make sure that we actually made the right decision.

We should approach such situations blamelessly. We are allowed to get these things wrong as we collectively learn a new skill. It is better to have tried and failed than not to have tried at all. Most of our decisions (or indecisions) can be reversed. Unlike in the train system:

Get more out of your checklists: liminal states in software delivery

When we spend time focussing on specific problems, it comes at the cost of the ability to focus on others. The mind becomes conditioned to make certain things invisible, no matter how many times we look them over.

In the case of this article, it might be a spelling mistake, but in the case of a product it can be all kinds of things – bugs, designs, even market fit.

But we can learn to see again. There are ways to escape the myopia, and one is the power of liminal thinking.

Rites of Passage

Transitions are an important part of human society. The “rite of passage” shows up in many forms across many different cultures, as people move from one stage of life into the next. They signify a change of status, and are said to have three phases:

  1. Pre-liminality, a “breaking away” from the past. The metaphorical death of the previous status.
  2. Liminality, where the subject is dislocated and temporarily lives outside the normal environment. Fundamental questions about knowledge and existence are brought to the foreground.
  3. Post-liminality, where the subject is re-incorporated into the environment with a new status.

In the liminal period, we enter a state where we are leaving something behind, but not yet fully in something else. We are able to look outside of the context of our normal day to day experience, entering a period of scrutiny where we re-examine our reality.

After years building and releasing software products, I am noticing how liminality plays out in the process of delivering our software. It’s a useful mental model that can help us improve our systems and experiences.

The Event

Our systems enter into liminal periods through events.

Scrum is designed to continuously evoke transitional thinking through the use of various events – sprints, retrospective, refinements. During these events, we use liminality to remember tasks forgotten, find the issues left languishing at the backlog bottom, see the mistakes that should have jumped off the page.

There are also larger events as a product passes through lifecycle phases, like the beginning of early field trials, or a large public launch. As we approach these events we bring in the managers, pull out the checklists, and probably have an assortment of walkthroughs, reviews and check-ins.

Those product lifecycle transitions are great opportunities to take advantage of liminality. We can do much more than just reserve a time to review a specific set of work – testing, monitoring, security, etc.

The Liminal Space

What leadership needs to focus on is creating a liminal space for teams to inhabit.

We can use structure and semiotics to amplify the state of mind. For example, we can use bookend meetings to signal that we are entering and exiting the liminal space.

That way, when we go through the process of reviewing things, we will do more than just check boxes: we will question the intention and existence of those check boxes. We will find new ways to extend their intention and existence into our continuous processes.

Note that we may even feel a sense of guilt or shame as we engage in this process. “How could we have overlooked this? How could we have not noticed that? How did we collectively forget to do something so important?”

Fear not! Foregrounding the invisible is the goal. Software is synthetic, so we should never expect to know everything, and always be ready to uncover the unknown and unexpected.

Be careful not to punish yourself or others for what you may find, as it will set back the entire process. Proceed into the liminal period without judgment. Create a safe, blame-free space to reflect. And give yourself some credit – it takes courage to go boldly into the liminal space.

Liminal Thinking

The anthropological study of liminality tells us that it is a special and privileged time. If we recognize that we can use harness the energy to do more than just scrutinize the work: we can also revisit the central values and axioms that guided its creation.

That is what we should strive to do in this period: inspire liminal thinking. At the heart of the process is questioning the very existence of the feature itself. “Someone remind me, why did we build this?” This is a time of renewal, refresh and rebirth.

So my suggestion is this: use this time to get creative and imaginative.

Don’t ignore the opportunity for liminal thinking. Don’t just go through the motions and check the boxes. Do the reviews and the retrospectives and the rethinking, knowing you have the capability to see your system in a different way in this time.

Discover what you have unconsciously learned to unsee, question everything, and feed this back into your product and your experiences.

Notes on a Learning Organization

Building software products means coping with complexity. Our products are highly interconnected systems of systems. Dynamics are difficult to model; outcomes can be difficult to predict.

Ivory towers crumble on this unstable ground. It is not sufficient to have one person deciding for the whole group, everyone following the direction of a “grand strategist”. Decision-making in complex environments needs to be decentralized.

If we are all to take part in decision-making, then we all need to constantly improve our knowledge of the problems we face – we need to be active participants in a learning organization.

Learning is both an individual and a group experience. Organizations that excel tap the commitment to learn from people at all levels. And learning is something deeper than just taking in information. It is about changing ourselves – growing in meaningful ways that contribute to the whole. So we want to build an organizational culture that promotes a deep connection to collective growth.

I have found the five disciplines of learning from the book The Fifth Discipline helps to inspire ideas about how to develop this culture. A quick summary:

The Fifth Discipline: The art and practice of the learning organization:  Senge, Peter M.: 8601420120846: Books -
  1. Systems thinking – seeing how all of our work connects, sharing the big picture. Proactively developing a collective understanding that we can use at all levels to spot patterns and persistent forms. Overcoming when systems develop built-in limitations to growth. (Remember, systems can be technical, organizational, managerial, etc.)
  2. Personal Mastery – developing our ability to see our reality more clearly. Our work is creative work, and creativity results from the movement of ideas. “Still water becomes stagnant.”
  3. Mental Models – guiding our ability to make decisions. How are we building them? Sharing them? Reinforcing them? They have a natural entropy and tendency to diverge, so they must be actively maintained and continuously refreshed.
  4. Shared Vision – having a shared vision enables co-creation. When we have that vision present in our day to day work, we are building things together for a common goal. This helps frame the context of the learning we engage in.
  5. Team Learning – this involves two practices: discourse and discussion. Discourse is the practice of presenting knowledge to others. Discussion is the practice of inquiry and exploration into the discourse. Both are crucial. 

To build a learning org, we can start by asking the questions in these disciplines to enrich our learning experience. Questions like: How can we help other teams expand their understanding of how our work connects with theirs? How can we share our mental models with others? What activities can we do to reinforce our shared values? What kinds of “discourse” activities can we use to promote discussions?

We can start doing that today. Start by creating the Tiny Habits needed to ask these questions at the right times. We should not expect anyone to build out learning organization for us – we all need to take ownership and do our part. What are you doing to help?

Are You Telling the Story of Your Software?

This is Minard’s analysis of the fall of Napoleon’s Grand Army at the French invasion of Russia in 1812, considered a masterpiece of analytic design:


It features in my favourite book by Edward Tufte, “Beautiful Evidence”, as the focal piece for explaining the principles of data presentation. It is a landmark in the history of the ‘infographic’, renowned for its cleverness (in stark contrast to the military disaster it describes).

With not much more than a glance, we can learn a great deal. We immediately see the French Army was decimated – that the approach cost many lives, that the attack on Moscow was ruinous. We see that the timing for the return to France could not have been worse, the sinking mercury being the death of many. We note that river crossings were particularly dangerous, each costing a great many thousand lives.

But something is missing.  

We are missing why? Why was Napoleon attacking Russia? Why was he doing this just before the onset of Winter? And why did all these people agree to such a poorly thought-out plan?

From Minard’s analysis we know “what” happened, but we do not understand the “why” behind any of it.

In software, we spend a lot of time working in the analytic “thinking space”, a place where we are taking things apart and trying to figure out how they work. It’s a safe space, because if you do it well, you will probably be able to figure out most of what is happening. But does this help you understand why it is happening? Does this help you tell the story of your software?

What story are you telling?

To paraphrase the great Ackoff, a systematic analysis of the parts generates knowledge, but it does not generate understanding. It does not explain why things are the way they are. To create understanding we need to look not at the parts, but rather what they are a part of.

We call this the “synthetic” thinking space, a concept that is ages-old, but which gained popularity with mid-century business thinkers like Ackoff, Deming and Drucker. When we work using synthetic thinking we want to play with the pieces, we put them together and look at them in their context, we come up with ideas by experimenting and observing their interactions.

What do I mean? Let’s look at an example:

(source – me … pre-covid)

This is a user story mapping exercise. We are playing with the pieces of a story that we deconstructed earlier. This exercise is designed to evoke synthetic thinking using a diverse group of specialists (plus a lot of La Croix and Red Bull). It is part of a repertoire of activities that enable us to come up with new understandings: new reasons why. 

Why do customers want us to solve their problems? Why do they want to use our software? Why are they willing to pay us money for it?

To answer these questions we experiment, we try things out, see what works, and fail constantly until we get it right. We use processes that are designed to help us put pieces together, and in ways that reveal the unique, the unexpected, and the innovative. We use synthetic thinking.

How are you telling your story?

It is not just a coincidence that we use the concept of a “story” to capture our work in software.  Since the beginning of human history, stories have been used to connect the dots, to bring people together, to generate knowledge and understanding about the world around us.

Stories connect the analytic with the synthetic: Analytic thinking deconstructs the problem, creating knowledge; Synthetic thinking puts the problem back together again, creating understanding. It’s through cycles of analysis and synthesis that we can change the world. 

For all forms of work, we need to ask ourselves: where should we spend the most time in these cycles?

What should we do more of or less of to drive these cycles? What story are you trying to tell?

If it’s a software story, you should be spending much of your time in the synthetic space, and using practices that support it.

That’s because software is synthetic, and for software development we have been learning to prioritize this way of thinking. It is the reason why certain practices succeed in our work. It is the unstated undercurrent that runs beneath many of our successful practices: synthetic work requires synthetic management.

Don’t Mix the Paint! Primitives and Composites in the World of Software

This Article was Originally Published in InfoQ on August 16th

What color do you think of when you hear the word “red”?

Ask 100 people, they will give you 100 different answers. Even with an anchor to help—a can of Coke, perhaps—there will be differences.

So begins The Interaction of Color by Josef Albers, where he uses various color studies to show the complexity of their interactions. He notes that even professional artists are surprised when presented with his examples, which indicates just how fickle the human mind is at interpreting color. 

Interaction of Color: New Complete Edition | Josef Albers

To train the eye, he has learners run experiments that demonstrate concepts like intensity, brightness, reversal, transparency, addition, subtraction, mixture, and so on. In doing these experiments, students work through various scenarios, manipulating color combinations to reveal their interactions.

It is interesting to note that Albers does not use pigment or paint for such experiments. Instead, he uses paper cut-outs. These provide the most reliable way to test scenarios repeatedly. The printed colors are fixed and indivisible.

Indivisible elements are critical for experimentation because they are irreducible. They create a high degree of reliability, which is needed to work with tests and compare results.

In software, we also engage heavily in experiments, and we also need indivisibles to work with. We call these “primitives,” and they come in two types—dynamic and semantic.

Primitives of Dynamics

Software is built from layer upon layer of abstractions. Machine language is abstracted using microcode primitives, microcode is presented as higher-level languages, and so on. At each level, simple elements are presented that abstract away the complexity of the operations underneath. 

Modern software development primarily involves synthesizing a variety of external elements: open source software, third party services, infrastructure APIs, and so on. We bring these together using code to create our systems. It is a world of composability, and our systems are mixtures of modules glued together with code.

Ideally, we would like to have elements that are designed to be used as primitives. Read the literature going back 50+ years, and you find the same good architectural advice: practice modular design—create primitive building blocks by strictly separating implementation from interface. “Don’t mix the paint!” 

This is the logic that drove the development of AWS—offer customers a collection of primitives so they can pick and choose their preferred way to engage with the services, not a single framework that forces you into a specific way of working, which includes everything and the kitchen sink. 


Of course in practice, at scale, it’s not that easy. See Hyrum’s Law, which says, with a sufficient number of users, all observable behaviors will be depended on by somebody. In other words, there is no such thing as a private implementation. We want to pretend that everything underneath the interface can be hidden away, but it’s not really the case. 

If only we could rely on our interfaces like the artist relies on the laws of chemistry and physics. But the things we build with are much less dependable. Accidents can happen many layers underneath our work that result in massive change all the way up the stack (see Spectre or Meltdown).

Implementations also need to change over time, as we learn about our systems, their users, and how the users use them. What we are implementing are ideas about how to do “something,” and these ideas can and should change over time. And here we come to our second type of primitive.

Primitives of Semantics

Every software system is a solution to a problem. But if we start with assumptions about the solution rather than clear statements of problems, we may never figure out the best use of time and resources to provide value to customers.

Even if we had a crystal ball and knew exactly how to solve our users’ problems, it would still not be enough. We also have to know how to get there—a map of the the increments of value to be delivered along the way. We need to find stepping stones to a final product, and each stone must align to something a customer wants. 

How do we do this? Again, we try to work from indivisible elements. I like to call these “semantic primitives.” We want these, our raw materials, to be discreet and independently evaluable. Again, “don’t mix the paint!”

These are implemented in various ways. The word “requirements” gets a lot of hate these days. “User stories” are popular, but “use cases” have fallen out of fashion. After a blog post on Medium, “jobs-to-be-done” became  “the framework of customer needs” seemingly overnight. 

Regardless of how you conceive them, the purpose is the same: to serve as building blocks for understanding the problem we want to solve and to help us be creative as we move along our product journey. 

When starting with a set of semantic primitives, we can learn from one, make mistakes with another, fall over a third, pivot between the rest, and so on. In theory, they allow the development process to become changeable and continuously aligned to delivering incremental value to customers.

But again, in practice, they are challenging to work with. These are not exhaustive proofs or laws of mechanics. They are assumptions and estimations, usually based on poorly sampled probabilities and questionable causality, crafted loosely with a lot of grey language. And they have to be, because our understanding of our customers and their problems is necessarily incomplete.

Synthetic Systems

Let’s go back to the difference between “red” on paper and “red” in your mind. On paper, the color is stable, it is factual and replicable. But in your mind, the color is unstable and ambiguous. This is the world of objects vs. the world of our minds.

In software, we don’t have the luxury of primitives with stable dynamics like those found in the world of objects. Our systems are synthetic, made up of cognitive composites that are subject to change without notice. We work only inside the world of our minds.

The system dynamics we observe today may not be the same we observe tomorrow. The semantics that we wrote down today may no longer be valid tomorrow. We live in a world of constant uncertainty and emergent knowledge.

To work with such uncertainty, we need to adopt a corresponding mindset.

A big part of the journey in software is learning to suspend the heuristics and mental models that we rely on when interacting with the world of objects. Getting burned by misplaced trust or untested assumptions is part of the evolution from junior to senior.

So we learn to think differently, we learn to challenge everything we see. But is this enough?

It’s worse than we thought!

In his book, Thinking Fast and Slow, Daniel Kahneman talks at length about the ease with which we fool ourselves into believing in improbable outcomes, and how we are particularly susceptible to illusions of cognition.

The dress - Wikipedia

Take the example of The Dress. After learning about this visual illusion we easily adjust our understanding. Once we know the truth, we correctly say that the color is black and blue, even though our eyes may still deceive us. We do this by consciously doubting what is presented. 

But when it comes to illusions of cognition, it’s a different story. Evidence that invalidates ingrained thinking is not easily accepted. Look to the Dunning-Kruger effect, or even anti-vaxxers and flat-earthers for some extreme examples. This does not bode well for us.

Just like the professional artists that are surprised by Albers’ color studies, even the most grizzled veterans of software delivery will make surprisingly incorrect assumptions about their systems, about whether things will work, whether they will continue to work, whether they create value, whether they meet customer needs, and so on.

And it’s no wonder—living in a world of constant conscious doubt is hard. It demands a lot of energy. We have to be unrelenting to resist the urge of falling back on the heuristics we learned from the world of objects.

Conscious doubt creates cognitive strain, and doing so constantly is a heavy burden. This is probably one reason why software development has a high rate of “burnout.” So what do we do?

Systems of Synthetic Management

Let’s restate the problem:

  • Our materials (code, interfaces, requirements, etc.) are derived from unstable primitives. 
  • To use our materials, we need to adopt a mindset of challenging everything. 
  • We can’t trust ourselves to consistently use that mindset. 

The solution? Systems.

We can use systems that manage the uncertainty for us, that create bounded contexts of risk and bring new knowledge to the foreground as it emerges. 

  1. We have the incredible power of programmability. We can invest heavily in the gifts given by the medium of code. We can construct elaborate systems of test automation, continuous integration, and production monitoring to unrelentingly test every one of the assumptions we make about how things are supposed to work.
  2. We have practices of agile, lean, and design-thinking to guide us in managing our semantics. We have developed methods driven by research and statistics to generate better primitives. We can work iteratively within boundaries that limit the scope of inevitable errors. We can use these practices to find metastable states that enable us to move forward. 

We are still maturing these systems of synthetic management, developing the competencies required to manage and control our work’s synthetic nature. It pays to remember that it has been only 20 years since the discovery of Agile and even less since the word DevOps was coined. Our field is new, and we have not yet mastered these ways of working, though of course we easily slip into cognitive illusions that convince us otherwise. 

All kinds of work involve overcoming illusions. But software has the added burden of resisting the siren’s song of a stable world of dynamic and semantic primitives. Fortunately, we can create systems to escape from it, and the development of those systems will define our success in delivering value with software.

What about Formal Verification?

In the paper A Formally Verified NAT Stack, the authors describe a software-defined network function that is “semantically correct, crash-free, and memory safe.” That’s right: it has no bugs, it will never crash, and they can prove it! But how important is this?

For decades, formal verification has been lurking with a promise of helping us make better systems. Since the beginning of programming, there has been steady work to create languages, tools and systems to do formal verification, which is an analytic method of exhaustively proving the correctness of code, rigorously specifying mathematical models to ensure correct behaviour.

At the forefront of this effort was Sir C.A.R Hoare, inventor of the quicksort algorithm, the null reference, and lifelong labourer on formal specification languages like Z and CSP.

But formal verification has never really caught on in software, and in 1996, at a talk at a theoretical computer science workshop, Hoare conceded:

Ten years ago, researchers into formal methods (and I was the most mistaken among them) predicted that the programming world would embrace with gratitude every assistance promised by formalisation to solve the problems of reliability that arise when programs get large and more safety-critical.

Programs have now got very large and very critical – well beyond the scale which can be comfortably tackled by formal methods. There have been many problems and failures, but these have nearly always been attributable to inadequate analysis of requirements or inadequate management control. It has turned out that the world just does not suffer significantly from the kind of problem that our research was originally intended to solve.

This begs the question, what is the problem that their research was intended to solve?

In product development there are two sides of “correctness”:

  • Validation – did we build the right thing? Did we build something that people want to use?
  • Verification – did we build it the right way? Have we made the thing we were trying to make?

Build the right thing; build the thing right…

As Ousterhout writes in A Philosophy of Software Design, “the most fundamental problem in computer science is problem decomposition: how to take a complex problem and divide it up into pieces that can be solved independently.” This is the essence of analysis. You would be forgiven for thinking, as many did during Hoare’s time, that analytic approaches would be successful in managing software development.

In terms of validation, we have seen a number of spectacular failures come out of analytic management approaches like Waterfall. We have seen a lot of software that technically works, but doesn’t meet the user’s needs:

None Pizza, Left Beef

So over the years, we learned to do validation differently.

We created processes that continuously check our work using empirical approaches – methods designed to integrate the testing of assumptions throughout the lifecycle of the product. A few years after Hoare’s speech, this was first articulated as the Agile Manifesto, and we have been building on it ever since then. I like to call these approaches Synthetic Management practices (synthetic means, “finding truth by recourse to experience”).

What about verification?

The book Software Engineering at Google describes the difference between programming (development) and software engineering (development, modification, maintenance). Software engineering is programming integrated across people and over time. I love this quote that highlights the difference: “It’s programming if ‘clever’ is a compliment, it’s software engineering if ‘clever’ is an accusation.” Being correct in the moment is a lot different than scaling correctness across people and over time, allowing the system to change but still be correct.

  • To scale correctness, we need systems of verification, and here we turn heavily to the synthetic approaches. We manage our code using unit tests, functional tests, integration tests, build systems, ci/cd, and so on. Over the years, we have replaced manual QA processes with programmatic forms of verification that operate continuously throughout the lifecycle of our work. These are the systems of software engineering we use to enable a codebase to change reliably across many people over a long time.

So when it comes to verification techniques, we need to consider not whether a technique works, but whether the technique can be used as part of an organizational system of engineering.

Does formal verification have a place in systems of software engineering?

From the paper:

The proof of correctness of the NAT combines theorem proving for data structures code (stateful) and symbolic execution for the rest of the code (stateless). Even though theorem proving requires a lot of human effort, NF developers can reuse the verified data structures for many NFs. Symbolic execution is automated, and easy to run on a new NF

A Formally Verified NAT Stack, p. 9

There are two key points to make here:

  1. Formal verification is squarely analytic. We would need to decompose the problem upfront, and then use one-off manual processes to create code that is ‘eternally’ correct. This does not comport with the continuous, synthetic style of verification that has become so successful.
  2. To pull off the formal verification, a lot of manual effort goes into proving the data structures and then after the proving, the components become ossified, unchangeable. The code is perfect, so long as it doesn’t need change. It’s correct until we need to add new features. And then we have to do the whole “theorem proving” part again. That part is very hard.

The idea that we can reuse data structures without changing them is naïve. Write once, run forever? The real world doesn’t work that. Code needs to breathe, and the air it breathes comes through the winds of change. (ok, maybe that’s a little too much metaphor).

In other words, formal verification does not scale with systems of software engineering. It violates the principle that makes synthetic verification so effective: become part of the system. Formal proof lives outside the code, outside the system – it does not “take advantage” of the medium of code.

What to make of formal verification then?

Systems built using the medium of code need programmable proof that lives and executes with the code. Tests are easy to implement as part of the system, and so they have become the primary method of demonstrating “correctness”. Software is synthetic and therefore lends itself to synthetic proof. 

But as Bentley says in Programming Pearls:  formal proofs may not become organizationally useful in software, but they have improved and evolved our understanding of how to write good code. For learning about algorithms and expanding our collective understanding of computer science, formal proofs have value. So yes, there is a place for formal proof in software, just not in the software industry.

Collaboration Calculus

Two of my favourite books show us the yin and yang of success – the individual and the team.

Drive: The Surprising Truth About What Motivates Us: Pink, Daniel ...

Dan Pink’s Drive describes three key areas of motivation and personal growth:

  • Autonomy: The desire to be self-directed
  • Mastery: The urge to get better skills
  • Purpose: The need to work with meaning
Turn the Ship Around! eBook by L. David Marquet - 9781101623695 ...

Meanwhile, in David Marquet’s Turn the Ship Around we find a set of values for empowered teams:

  • Autonomy: Give people control over their decisions
  • Alignment: Provide the information people need to make the best decisions
  • Transparency: Ensure people communicating their decisions to everyone around them

Putting these together, you get the calculus of collaboration – two sides of the equation for successful teamwork:

On both sides of the equation, you have autonomy: the essential element for both individuals and organizations. Empowered people are at the core of empowered teams.

Autonomy is the ability for individuals to work independently without of micromanaging.

But of course, autonomy unhinged will only result in chaos. If everyone goes off and does their own thing, work will be confusing and unproductive. “Divide the fire, and you will sooner put it out.”

So autonomy, the key to success, must be modified on both sides of the equation:

  • At the personal level, the effectiveness of autonomy is amplified by mastery and purpose. The more we develop our mastery and work with purpose, the greater we are able to make use of our autonomy.
  • At the team level, autonomy is tempered by the group’s need for alignment and transparency. The more aligned the team is and the more transparent team members are with one other, the better we are able to apply our autonomy to collective problem solving.

As individuals, when we achieve mastery and purpose, we make the most of our autonomy.

As teammates, when we practice transparency and alignment, we give the best of our autonomy.

Putting Pink and Marquet together, you get the collaboration equation at the centre of dynamic learning organizations. Solve for a: solve for autonomy.

The Seven Forms of “Not Waste”

Lean manufacturing has made a really big deal about waste. There’s muda, mura and mudi; You’ve got your Type I and Type II; And of course 7, 8 or maybe 9 forms of it, depending on who you ask. The Inuit words for snow have nothing on the Japanese words for waste.

With such an obsession with waste, you might think Taichi Ono had some kind of childhood trauma associated with it. After reading a few books on Lean, I’m looking for waste everywhere, and I’m starting to think that some kind of trauma has been inflicted on me!

By casting out waste in all its forms, they say, we can create a smooth flow of work across our production system. Remove all variance, they say, and only then can a system truly create value.

In highly deterministic systems, this thinking wins arguments. And we desperately want the systems that deliver software (organizational and technical) to be highly deterministic. But our systems are resistant to analysis that grants determinism – they are synthetic. We embrace the uncertainty, adjust for instability, and renounce absolutism.

So instead of trying to come up with deterministic categories of things that ARE waste, maybe it makes sense to think more about things that are NOT waste. Things like,

  1. Running experiments: It’s really hard to run experiments in an active manufacturing line – that would result in a lot of waste. But fortunately for you, code is (relatively) free. We can start all our work with a hypothesis and validate it. Got it wrong? Try again. The reality is that every increment of value begins with experiments (whether you know it or not). But people get confused by this – they think that spending the time to see something running in the system is not value-adding … instead of 2 weeks to see what it looks like, we should spend 4 weeks writing specs? This might seem crazy (because it is), yet people still do it. 
  1. Writing code: Sometimes the hardest thing to do is get started. Procrastination. Trepidation. But fear not! Just write the code! Try it out. See what happens. Run the experiments. In software, arguments don’t win arguments, working code wins arguments. Sometimes we need a gentle reminder that software is code, not meetings and ppts and spreadsheets. This is your reminder.
  1. Showing your work: It’s never too early to share your work with others. You don’t need to polish it. Don’t need clean it up. Don’t need to worry if it looks good yet. Make your work visible, as early as possible. Does showing your work scare you? Bring it up with your manager. Leaders should strive to make your engineering environment ego-free and safe to be wrong. The alternative is to rat-hole for days and turn up later with 500 lines of code and a single pull request – that’s waste. Collaborating with a teammate to get some feedback on your work? That’s value. Get that value.
  1. Estimates: Just kidding! Estimates are waste. Stop doing them. “Velocity” is the biggest trick ever played on Agile. The only valid argument that I know of for having a formal practice of estimating every piece of work is to find out if people understand the work differently. But the estimate is still wrong, and the practice is highly vulnerable to corruption and malice and all kinds of evil. If you make a practice of it, it won’t be long before your estimates are weaponized and someone starts bin packing your milestones. #NoEstimates
  1. Retrospectives: You always have time for a retro. The only valid excuse that I know of for not doing the retro is “I forgot”. I will actually accept that answer. Should we really get together for a whole hour and talk about the past? If you want the future to go well, then yes, you should. To find your team’s cadence, experiment and adjust as necessary. 
  1. Demos: It’s not done-done until you do the demo-demo. Demos have so much value for the team, but it’s easy to make excuses and skip them to “give people their time back”. What do you think they will be doing with that hour on Friday afternoon that couldn’t be done half-zoned out at the demo? (Yeah, I said it…) Just do the demo and let people see what you are working on (see #3). It’s also a good opportunity for people to build their presentation and leadership skills, which is also value. 
  1. Documentation: This is how code scales. I know it’s true because I read about it in this book from Google. You believe Google, don’t you? Software is a Game of Intents. Do not allow intents to get obfuscated. Do not force the reader to dig for intents. Take special care to document anywhere intents are non-obvious. Leave evidence of intended behavior, especially when it is surprising.

So we now that we have a few ideas about what waste is NOT, does this help us with understanding what waste IS?

Well, even what Lean might call “defects”, might not be worth calling waste in software. As my friend Lucas Siba likes to say, “Mistakes are just high speed learning opportunities.” An excellent reminder comes from this book:

“It is tempting but incorrect to label anything that goes wrong on a project as waste. Human beings make mistakes. A developer may accidentally push code before running the test suite. Our knowledge is limited. A product manager may write an impractical user story because he or she does not know of some particular limitation. We forget. A developer might forget that adding a new type to the system necessitates modifying a configuration file. Whether we conceptualize these sorts of errors as waste is a matter of opinion, but focusing on them is unhelpful because they are often unpredictable.”

Rethinking Productivity in Software Engineering, p. 225

It might have to do with the non-linear structures of synthetic systems. It might have to do with the difficulty of expressing creative work in the form of an equation. (We are not machines!) It might be that models of complex systems are difficult to use for predicting the future. (“Past performance does not predict future results,” and so on…) No one has come close to defining an ergodic system for measuring software development, and so it’s hard to put labels like “waste” or even “value” on things.

Remember, once upon a time Yahoo banned remote altogether because … because waste. (What are they doing now by the way, with all the Covid?). Meanwhile, there are many highly successful organizations like Sonatype that are 100% remote work. In other words, what looks like waste to some is value to others.

So my recommendation is this: don’t focus on trying to eliminate waste, try to find the good in everything. Capitalize on the hidden value, wherever it is. And have fun doing it.