eferro's random stuff: Mutation Testing: When "Good Enough" Tests Weren't

For weeks, I had been carrying this nagging doubt. The kind of doubt that's easy to ignore when everything is working. My inventory application had 93% test coverage, all tests green, type checking passing. The code had been built with TDD from day one, using AI-assisted development with Claude, Cursor (with Sonnet 4.5, GPT-4o, and Claude Composer), what I like to call "vibecoding". Everything looked solid.

It's not a big application. About 650 lines of production code. 203 tests. A small internal tool for tracking teams and employees. The kind of project where you might think "good enough" is actually good enough.

But something was bothering me.

I had heard about mutation testing years ago. I even tried it once or twice. But let's be honest: it always felt like overkill. The setup was annoying, the output was overwhelming, and the juice rarely seemed worth the squeeze. You had to be really committed to quality (or really paranoid) to go through with it.

This time, though, with AI doing the heavy lifting, I decided to give it another shot.

The First Run: 726 Mutants

I added mutmut to the project and configured it with AI's help. Literally minutes of work. Then I ran it:

$ make test-mutation
Running mutation testing
726/726  🎉 711  ⏰ 0  🤔 0  🙁 0  🔇 15  🔴 0
33.50 mutations/second

Not bad. 711 mutants killed out of 726. That's 97.9% mutation score. I felt pretty good about it.

Until I looked at those 15 survivors.

The 15 Survivors

I ran the summary command to see what had survived:

$ make test-mutation-summary
Total mutants checked: 15
Killed (tests caught them): 0
Survived (gaps in coverage): 15

=== Files with most coverage gaps ===
    5 inventory.services.role_config_service
    4 inventory.services.orgportal_sync_service
    2 inventory.infrastructure.repositories.initiative
    1 main.x create_application__mutmut_6: survived
    1 inventory.services.orgportal_sync_service.x poll_for_updates__mutmut_6: survived
    1 inventory.db.gateway
    1 inventory.app_setup.x include_application_routes__mutmut_33: survived

There they were. Fifteen little gaps in my test coverage. Fifteen cases where my tests weren't as good as I thought.

And remember: this is a 650-line application with 203 tests. If I found 15 significant gaps here, what would I find in a 10,000-line system? Or 100,000?

The thing is, a few months ago, this would have been the end of the story. I would have looked at those 15 surviving mutants, felt slightly guilty, and moved on. The effort to manually analyze each mutation, understand what it meant, and write the specific tests to kill it would have taken days. Maybe a week.

Not worth it for a small internal tool.

But this time was different.

What the Mutants Revealed

Before jumping into fixes, I wanted to understand what these surviving mutants were actually telling me. With AI's help, I analyzed them systematically.

Here's what we found:

In role_config_service (5 survivors):
The service loaded YAML configuration for styling team roles. My tests verified that the service loaded the config and returned the right structure. But they never checked what happened when:

The YAML file was missing
The YAML was malformed
Required fields were absent

The code had error handling for all these cases. My tests didn't verify any of it.

In orgportal_sync_service (4 survivors):
This service synced data from S3. Tests covered the happy path: download file, process it, done. But mutants survived when we:

Changed log messages (I wasn't verifying logs)
Skipped metadata checks (last_modified, content_length)
Removed directory existence checks

The code was defensive. My tests assumed everything would go right.

In database and infrastructure layers (6 survivors):
Similar story. Error paths that existed in production but were never exercised in tests:

SQLite connection failures
Invalid data in from_db_row factories
404 responses in API endpoints

Classic case of "it works, so I'm not testing the error cases."

The pattern was clear: I had good coverage of normal flows, but my tests were optimistic. They assumed the happy path and left the defensive code untested.

This is what deferred quality looks like at the micro level. Like Deming's red bead experiment (where defects came from the system, not the workers), these weren't random failures. They were systematic gaps in how I verified the system. Every surviving mutant is a potential bug waiting in production, interrupting flow when it surfaces weeks later. The resource efficiency trap: "we already have 93% coverage" feels cheaper than spending 2-3 hours... until you spend days debugging a production issue that a proper test would have caught.

The AI-Powered Cleanup

But this time I had AI. So I did something different.

I asked Claude to analyze the surviving mutants one by one, understand what edge cases they represented, and create or modify tests to cover them. I just provided some guidance on priorities and made sure the new tests followed the existing style.

(The app itself had been built using a mix of tools: Claude for planning and architecture, Cursor with different models for implementation. But for this systematic mutation analysis, Claude's reasoning capabilities were particularly useful.)

In about two or three hours, we had addressed all the key gaps:

SQLite error handling: I thought I was testing error paths, but I was only testing the happy path. Added proper error injection tests.
Factory method validation: My from_db_row factories had validation that was never triggered in tests. Added tests with invalid data.
Edge cases in services: Empty results, missing metadata, nonexistent directories. All cases my code handled but my tests never verified.
404 handling in APIs: The code worked, but no test actually verified the 404 response.

The result after several iterations:

$ make test-mutation
Running mutation testing
726/726  🎉 724  ⏰ 0  🤔 0  🙁 2  🔇 0
30.02 mutations/second

$ make test-mutation-summary
Total mutants checked: 2
Killed (tests caught them): 0
Survived (gaps in coverage): 2

=== Files with most coverage gaps ===
    1 inventory.services.role_config_service
    1 inventory.db.gateway

From 15 surviving mutants down to 2. From 97.9% to 99.7% mutation score.

The coverage numbers told a similar story:

Coverage improvements:
- database_gateway.py: 92% → 100%
- teams_api.py: 85% → 100%
- role_config_service.py: 86% → 100%
- employees_api.py: 95% → 100%
- Overall: 93% → 99%
- Total tests: 203 passing

The Shift in Economics

Here's what struck me about this experience: the effort-to-value ratio had completely flipped.

Before AI, mutation testing was something you did if:

You had a critical system where bugs were expensive
You had a mature team with time to invest
You were willing to spend days or weeks on it
The application was large enough to justify the investment

For a 650-line internal tool? Forget about it. The math never worked out.

Now? The math is different. The AI did all the analysis work. I just had to review and approve. What used to take days took hours. And most of that time was me deciding priorities, not grinding through mutations.

The barrier to rigorous testing has dropped dramatically. And it doesn't matter if your codebase is 650 lines or 650,000. The cost per mutant is the same.

The Question That Remains

I've worked in teams that maintained sustainable codebases for years. I know what that forest looks like (to use Kent Beck's metaphor). I also know how much discipline, effort, and investment it took to stay there.

Now I'm seeing that same level of quality becoming accessible at a fraction of the cost. Tests that used to require days of manual work can be generated in hours. Mutation testing that was prohibitively expensive is now just another quick pass.

The technical barrier is gone.

So here's the question I'm left with: now that mutation testing costs almost nothing, will we actually use it? Will teams that never had the resources to invest in this level of testing quality start doing it?

Or will we find new excuses?

Because the old excuse ("we don't have time for that level of rigor") doesn't really work anymore. The time cost has collapsed. The tooling is there. The AI can do the heavy lifting.

What's left is just deciding to do it. And knowing that it's worth it.

What I Learned

Three concrete takeaways from this experience:

1. Line coverage lies, even in small codebases: 93% coverage looked great until mutation testing showed me the gaps. Those 15 surviving mutants were in critical error handling paths. After fixing them, I still had 99% line coverage. But now the tests actually verified what they claimed to test. If a 650-line application had 15 significant gaps, imagine larger systems.

2. AI makes rigor accessible for any project size: What used to be prohibitively expensive (manual mutation analysis) is now quick and almost frictionless. The economics have changed. From 15 survivors to 2 in just a few hours of work, most of it done by AI. This level of rigor is no longer reserved for critical systems. It's accessible for small internal tools too.

3. 99.7% is good enough: After the cleanup, I'm left with 2 surviving mutants out of 726. Could I hunt them down? Sure. Is it worth it? Probably not. They're edge cases in utility code that's already well-tested. The point isn't perfection. It's knowing where your gaps are and making informed decisions about them.

The real win isn't the numbers. It's the confidence. I now know exactly which 2 mutants survive and why. That's very different from having 93% coverage and hoping it's good enough.

This was a small project. If it had been bigger, I probably would have skipped mutation testing entirely (too expensive, too time-consuming). But now? Now I can't think of a good reason not to do it. Not when it costs almost nothing and reveals so much.

I used to think mutation testing was for perfectionists and critical systems only. Now I think it should be standard practice for any codebase you plan to maintain for more than a few months.

Not because it's perfect. But because it's no longer expensive.

And when the cost drops to almost zero, the excuses should too.

The AI Prompt That Worked

When facing surviving mutants, this single prompt did most of the heavy lifting:

"Run mutation testing with make test-mutation. For each surviving mutant, use make test-mutation-show MUTANT=name to see the details. Analyze what test case is missing and create tests to kill these mutants, following the existing test style. After adding tests, run make test-mutation again to verify they're killed. Focus on the top 5-10 most critical gaps first: business logic, error handling, and edge cases in services and repositories."

The key: let the AI drive the mutation analysis loop while you focus on reviewing and prioritizing.

Getting Started

If you want to try this:

Add mutmut to your project (5 minutes with AI help)
Create simple Makefile targets to make it accessible for everyone:
- make test-mutation - Run the full suite
- make test-mutation-summary - Get the overview
- make test-mutation-report - See which mutants survived
- make test-mutation-show MUTANT=name - Investigate specific cases
- make test-mutation-clean - Reset when needed
Run it weekly, not on every commit (mutation testing is slow)
Use AI to triage survivors (ask it to analyze and prioritize)
Review the top 5-10 gaps as a pair, decide which matter
Start with one critical module, not the whole codebase

Making it easy to run is as important as setting it up. The barrier is gone. What's stopping you?

When NOT to chase 100%: Those final 2 surviving mutants? They're in logging and configuration defaults that are battle-tested in production. Perfect mutation score isn't the goal. Knowing your gaps is. Focus on business logic and error handling first. Skip trivial code.

About This Project

This application was developed using TDD and AI-assisted development with Claude code and Cursor (using Sonnet 4.5, GPT-5 codex, and Composer1). The mutation testing setup and gap analysis were done with Claude's help using mutmut.

Timeline: The entire mutation testing setup and gap analysis took about 2-3 hours with AI assistance.

Final stats: 649 statements, 208 tests, 99% line coverage, 726 mutants tested, 724 killed (99.7% mutation score).

Páginas

Sunday, November 09, 2025

Mutation Testing: When "Good Enough" Tests Weren't