When green tests mean nothing (Chicago vs London unit testing)

A few months ago I put together a slide deck for an internal hackathon day at Xero, then gave an expanded version as an unconference session. The title was Mock Theatre: when green tests go rouge, a deliberately theatrical way of saying that a wall of green ticks is not the same thing as confidence.

I only found out I was speaking three days before the session, so I wrote the whole deck in markdown and published it as a static website. It worked brilliantly: arrow keys to navigate, deep links to individual slides, no PowerPoint wrestling. The full deck is here:

Mock Theatre: when green tests mean nothing. Use arrow keys to move through all 49 slides, or jump straight to a topic via the URLs below.

Why green tests can lie

The opening question in the deck is simple: why can green tests lie? Tests pass, CI is green, and yet production still breaks... or worse, the team ships the wrong behaviour with a clean conscience.

A common culprit is the over-specified interaction test: the test asserts that collaborator X was called with message Y in exactly that order, even when the business only cares that the user received the right outcome. Change the wording of an email template and the test fails, even though the feature still works.

Other smells and failure modes from the deck:

Interaction-only tests can pass while outcomes are still broken
Over-mocking hides integration mistakes and configuration errors
Mocking everything makes failures noisy and brittle
Tests locked to private sequencing rather than observable results

I called this Mock Theatre because the test suite performs a convincing play... every mock gets its line, every expectation is met... but nobody checked whether the story made sense for the customer.

If this sounds familiar, you may also enjoy my older post Are unit tests worth all that dough?, same theme from a startup delivery angle.

Chicago vs London: two schools, two questions

The deck's second thread is Chicago vs London style testing. I had honestly never heard the cities used as shorthand until a colleague mentioned it the week before the talk... but the ideas are old, and the terminology map in the slides makes the aliases clear:

Label	Also known as	What you verify
Chicago	classical, classicist, sociable	State: outputs, return values, persisted data
London	mockist, solitary	Behaviour: calls, messages, collaboration protocol

The names are historical, not prescriptive. What matters is the question each style asks. Slide 15 puts it neatly as two styles, two questions:

Chicago (state-based): After I run this, what state or result exists?

London (interaction-based): Did I collaborate with dependencies the right way?

The deck walks through the same invoice-service example both ways. Chicago style, state-based checks that totals and repository state are correct, while the London style, interaction-based side verifies that the right collaborators were invoked. Seeing them side by side on slide 20 is the fastest way to internalise the difference.

Martin Fowler's Mocks Aren't Stubs is the canonical reference here; the deck quotes it heavily from slide 27 onwards.

When to use which (and when to mix)

Neither school wins everything. The deck argues for deliberate trade-offs:

When Chicago is a better default: core domain logic with deterministic outputs, when you want refactoring freedom, and when you can verify correctness through returned values or state without standing up the whole world.

When London is a better default: side effects matter more than return values, external systems and queues are involved, or outside-in TDD needs a precise collaboration contract.

A hybrid strategy that works in practice:

Use Chicago style for domain and pure computation
Use London style at boundaries with side effects
Keep mocks at the edge, not everywhere
Prefer fakes over mocks when the protocol is simple
Mix both in the same service only when the test goal is explicit

Slide 26 makes the refactoring trade-off concrete: swap FixedDiscountPolicy for TieredDiscountPolicy and Chicago tests stay green if invoice totals stay correct; London tests may break if the call graph changes. Both outcomes are acceptable, if you chose deliberately.

Mocks, stubs, fakes, and why vocabulary matters

A large chunk of the deck is vocabulary hygiene, because teams argue past each other when words mean different things. The Mocks Aren't Stubs primer distinguishes:

Mock: verifies behaviour (did the right interaction happen?)
Stub: provides canned responses (no verification)
Fake: a working lightweight implementation (e.g. an in-memory repository with real logic)

Only mocks insist on behaviour verification during setup. What your team agrees on is what counts... but put it in writing.

Beyond the pyramid

The later slides broaden the lens: test pyramids vs testing diamonds, where executable specifications live, approval testing with Verify, and keeping docs in sync with compiled snippets. Worth a skim if you are building financial or compliance-heavy systems where it compiles and the unit tests pass is nowhere near enough.

Takeaways

The closing slides boil down to three ideas from slides 43 and 44:

Chicago and London answer different questions about behaviour
Mocks are a technique, not a goal... they change what you verify
Test style shapes design seams and refactoring cost, so be explicit about intent

For deeper reading, the deck's references section points at Kent Beck's TDD book, Freeman & Pryce's Growing Object-Oriented Software, Guided by Tests, J.B. Rainsberger on integrated tests, and Thoughtworks' piece on mockists vs classicists.

Try the deck yourself

If you are leading a brown-bag, guild session, or unconference slot, the full Mock Theatre slide deck is designed to be walked through interactively, not lectured at people. Start at slide 1, skip to Chicago vs London if you are short on time, or jump to the tools and further reading at the end.

hashtag #unittesting #tdd #mocks #testing #xero

(This post expands on speaker notes I shared on LinkedIn for the Xero unconference session.)

Disclaimer: These views and opinions are those of the author and do not constitute professional advice. Neither Alan Hemmings nor Goblinfactory Ltd (if mentioned) shall be liable for any reliance on this content.