The great testing pyramid of Devops.

Are you using the test pyramid correctly? This simple concept needs careful consideration before you can actually get some value from applying it. Let's take a look at the The great testing pyramid of <del>Cheops</del> Devops.

What is the test pyramid?

The test pyramid is surprisingly much like the actual pyramids. It looks very simple on the outside, but it is confusing and difficult to navigate on the inside; while not actually full of deadly traps.

At its simplest it is just a heuristic. One for choosing where to expand effort in test automation projects. It has been floating around since 2009 in agile circles. And it states that the longer it takes to run tests, the fewer of those kinds of tests there should be.

To me this is the clearest description of the test pyramid. A more concrete description is available on the same site. Mike Cohn introduced it in "Succeeding with Agile". Though I suspect more have heard of it from Google's testing blog.

There's a lot of posts heralding it as the new(ish) silver bullet for agile development.

But there are also a lot of people who are less impressed. I happen to fall into that group. In this post I aim to explain to you why.

What is the problem with it?

The first problem is of course that: "Unit tests ARE NOT."

They are not valuable tests from a QA perspective since they don't really find many bugs at all, and practically none of the most common types of bugs ... i.e. integration bugs. They need a lot of maintenance in early phases of projects, and are simply not suitable for some kinds of code. Now please don't lynch me yet. Keep in mind that what I think when I see the word unit-test is not the same thing as you. Not exactly.

The second, more obvious, problem is that most people are not talking about the same thing. We have wildly different definitions of unit-tests and integration-tests. Sometimes it seems that practically no-one is using the same definitions. In fairness, making such definitions is a non-trivial problem. The lack of a consensus is the root of the problem. It is also the main reason why people still argue about many best/better/good practices.

The third problem is, of course: Wildly different needs for different teams. Depending on size, project-type and code-base... A Google-Best-Practice is almost certainly an antipattern in a smaller shop. They are optimizing for problems that the small organisation will definitely never face. And don't have to worry about many problems that smaller ones will face.

Can we fix the problems?

People shouting at each other online because of different definitions is par for course. Of course. It was upsetting when I realized that I did exactly that with the test-pyramid. I never looked past my own definitions of unit and integration tests. Then I condemned the whole thing as silly, as a new silver bullet for agile consultants and evangelists to shoot their teammates with. It turns out that with suitable definitions of units and integrations the pyramid can make sense.

Now those definitions of unit-tests and integration tests are, to me, wildly inappropriate. But for purpose of making sense of the benefits of the pyramid they are dead on

Unit = code-only tests. Possibly testing multiple genuine classes interacting. Only testing non-trivial code (no getters, setter or otherwise trivial functions).
Integration = System-level integrations: External dependencies and running services needed by the app.
E2E = UI-driven testing. Click, type and get-text your way to glory.

My internal model of what those should be is very different.

Unit = One class (or one file, or one function). Test the contents in absolute isolation. To verify and document correct interface behavior.
Integration = ALL THE OTHER CODE-BASED TESTS. If you don't mock all the things you now have an integration test.
E2E = All the tests that require actual running software to be able to run.

With such wildly different schemas, the different views are not surprising. One man's Golden Pyramid is another's Triangular Pile of Refuse.

Words matter. I've had many heated discussions with people about the different types of tests. Discussions where the bottom line was that they had different definitions. Different from me, and/or different from their own sources. Before we adopt some amazing new practice we'd do well to make sure we're talking about the same things. Same thing goes for condemning our teammates for idiocy. 😊 Although... at QPR Software we're allowed to be a bit idiotic, every now and then. We deeply value learning and learning always starts from ignorance.

It‘s worth noting that I‘m considering API tests to be a kind of e2e tests (for servers the UI is just the API). And that I‘m intentionally not considering many other kinds of tests: tests such as database tests, 3rd party integration tests, performance tests or load tests. They really don‘t fit into the pyramid, we consider them in totally different terms.

How do we fix them?

What does it even mean to fix those problems? If we can agree what the words mean, how do we figure out what proportions of (automated) test levels are right for us?

As software engineers we have a habit of abstracting early and abstracting a lot. We (over)generalize a lot. DRY has been hammered into most of us. To the degree that copy-pasting code around our codebase feels very... dirty.

How about we... stop. How about we work with specifics. How about we Don‘t Abstract Yet, instead we wait until we know enough to do so fruitfully?

This is hardly revolutionary. But how about we do the same for our discourse and exchange of ideas online and in the workplace? How about we focus on aspects of the projects that lead to given practices being beneficial? We could choose the ones that give our projects the most value.

Let's arm ourselves with good heuristics about practices and the context we need to get value out of them. Only then can we make good decisions about what to automate and what not.

Here are some examples of heuristics I find useful when <del>coming up with a test strategy</del> deciding what to test and what to automate.

How much of the code is „glue“. The higher the glue content, the lower the value of unit testing.
Do we have strong, static typing? That eliminates a lot of tests we‘d need for dynamically typed language.
What sort of UI do we have? Obviously we won‘t implement UI-e2e tests for a server side API. Also web based UIs (and console based ones) are much easier to test than native applications.
How much control do we have over the (test) data and dependencies? If we can‘t reliably set up our test cases, the ones we write will be flaky.
Are we working on an application/feature in the product-market fit stage? There‘s no point in automating tests for features that will be dropped next week.

There are of course many more. Find the ones that apply to your own project.

Finally

The right way to apply the test pyramid is to understand what the actual levels of tests are. What are the needs of your project for unit tests, integration tests and e2e tests. Not all projects lend themselves well to unit tests. Not all projects can support traditional UI based e2e tests. Projects have different needs with regards to quality and speed of execution. They have different resourcing

Start by defining the actual words. What's a unit test in your context? What's an integration test or an e2e test?
What ways do we have to group these tests together? When and how will we execute them?
Throw away the pyramid and find the right shape for your project.
Find the heuristics that fit for your project, and figure out how to apply them.

What is it called when we do that? : Test strategy and test planning, with a pinch of test-design... Who would have thought those three activities were too complicated to be reasonably depicted by a three-layer pyramid?

Take a look at QPRs blog for a more concise version of this post.