For weeks, I had been carrying this nagging doubt. The kind of doubt that's easy to ignore when everything is working. My inventory application had 93% test coverage, all tests green, type checking passing. The code had been built with TDD from day one, using AI-assisted development with Claude, Cursor (with Sonnet 4.5, GPT-4o, and Claude Composer), what I like to call "vibecoding". Everything looked solid.
It's not a big application. About 650 lines of production code. 203 tests. A small internal tool for tracking teams and employees. The kind of project where you might think "good enough" is actually good enough.
But something was bothering me.
I had heard about mutation testing years ago. I even tried it once or twice. But let's be honest: it always felt like overkill. The setup was annoying, the output was overwhelming, and the juice rarely seemed worth the squeeze. You had to be really committed to quality (or really paranoid) to go through with it.
This time, though, with AI doing the heavy lifting, I decided to give it another shot.
The First Run: 726 Mutants
I added mutmut to the project and configured it with AI's help. Literally minutes of work. Then I ran it:
$ make test-mutation
Running mutation testing
726/726 π 711 ⏰ 0 π€ 0 π 0 π 15 π΄ 0
33.50 mutations/second
Not bad. 711 mutants killed out of 726. That's 97.9% mutation score. I felt pretty good about it.
Until I looked at those 15 survivors.
The 15 Survivors
I ran the summary command to see what had survived:
$ make test-mutation-summary
Total mutants checked: 15
Killed (tests caught them): 0
Survived (gaps in coverage): 15
=== Files with most coverage gaps ===
5 inventory.services.role_config_service
4 inventory.services.orgportal_sync_service
2 inventory.infrastructure.repositories.initiative
1 main.x create_application__mutmut_6: survived
1 inventory.services.orgportal_sync_service.x poll_for_updates__mutmut_6: survived
1 inventory.db.gateway
1 inventory.app_setup.x include_application_routes__mutmut_33: survived
There they were. Fifteen little gaps in my test coverage. Fifteen cases where my tests weren't as good as I thought.
And remember: this is a 650-line application with 203 tests. If I found 15 significant gaps here, what would I find in a 10,000-line system? Or 100,000?
The thing is, a few months ago, this would have been the end of the story. I would have looked at those 15 surviving mutants, felt slightly guilty, and moved on. The effort to manually analyze each mutation, understand what it meant, and write the specific tests to kill it would have taken days. Maybe a week.
Not worth it for a small internal tool.
But this time was different.
What the Mutants Revealed
Before jumping into fixes, I wanted to understand what these surviving mutants were actually telling me. With AI's help, I analyzed them systematically.
Here's what we found:
In role_config_service (5 survivors):
The service loaded YAML configuration for styling team roles. My tests verified that the service loaded the config and returned the right structure. But they never checked what happened when:
- The YAML file was missing
- The YAML was malformed
- Required fields were absent
The code had error handling for all these cases. My tests didn't verify any of it.
In orgportal_sync_service (4 survivors):
This service synced data from S3. Tests covered the happy path: download file, process it, done. But mutants survived when we:
- Changed log messages (I wasn't verifying logs)
- Skipped metadata checks (last_modified, content_length)
- Removed directory existence checks
The code was defensive. My tests assumed everything would go right.
In database and infrastructure layers (6 survivors):
Similar story. Error paths that existed in production but were never exercised in tests:
- SQLite connection failures
- Invalid data in
from_db_rowfactories - 404 responses in API endpoints
Classic case of "it works, so I'm not testing the error cases."
The pattern was clear: I had good coverage of normal flows, but my tests were optimistic. They assumed the happy path and left the defensive code untested.
This is what deferred quality looks like at the micro level. Like Deming's red bead experiment (where defects came from the system, not the workers), these weren't random failures. They were systematic gaps in how I verified the system. Every surviving mutant is a potential bug waiting in production, interrupting flow when it surfaces weeks later. The resource efficiency trap: "we already have 93% coverage" feels cheaper than spending 2-3 hours... until you spend days debugging a production issue that a proper test would have caught.
The AI-Powered Cleanup
But this time I had AI. So I did something different.
I asked Claude to analyze the surviving mutants one by one, understand what edge cases they represented, and create or modify tests to cover them. I just provided some guidance on priorities and made sure the new tests followed the existing style.
(The app itself had been built using a mix of tools: Claude for planning and architecture, Cursor with different models for implementation. But for this systematic mutation analysis, Claude's reasoning capabilities were particularly useful.)
In about two or three hours, we had addressed all the key gaps:
- SQLite error handling: I thought I was testing error paths, but I was only testing the happy path. Added proper error injection tests.
- Factory method validation: My
from_db_rowfactories had validation that was never triggered in tests. Added tests with invalid data. - Edge cases in services: Empty results, missing metadata, nonexistent directories. All cases my code handled but my tests never verified.
- 404 handling in APIs: The code worked, but no test actually verified the 404 response.
The result after several iterations:
$ make test-mutation
Running mutation testing
726/726 π 724 ⏰ 0 π€ 0 π 2 π 0
30.02 mutations/second
$ make test-mutation-summary
Total mutants checked: 2
Killed (tests caught them): 0
Survived (gaps in coverage): 2
=== Files with most coverage gaps ===
1 inventory.services.role_config_service
1 inventory.db.gateway
From 15 surviving mutants down to 2. From 97.9% to 99.7% mutation score.
The coverage numbers told a similar story:
Coverage improvements:
- database_gateway.py: 92% → 100%
- teams_api.py: 85% → 100%
- role_config_service.py: 86% → 100%
- employees_api.py: 95% → 100%
- Overall: 93% → 99%
- Total tests: 203 passing
The Shift in Economics
Here's what struck me about this experience: the effort-to-value ratio had completely flipped.
Before AI, mutation testing was something you did if:
- You had a critical system where bugs were expensive
- You had a mature team with time to invest
- You were willing to spend days or weeks on it
- The application was large enough to justify the investment
For a 650-line internal tool? Forget about it. The math never worked out.
Now? The math is different. The AI did all the analysis work. I just had to review and approve. What used to take days took hours. And most of that time was me deciding priorities, not grinding through mutations.
The barrier to rigorous testing has dropped dramatically. And it doesn't matter if your codebase is 650 lines or 650,000. The cost per mutant is the same.
The Question That Remains
I've worked in teams that maintained sustainable codebases for years. I know what that forest looks like (to use Kent Beck's metaphor). I also know how much discipline, effort, and investment it took to stay there.
Now I'm seeing that same level of quality becoming accessible at a fraction of the cost. Tests that used to require days of manual work can be generated in hours. Mutation testing that was prohibitively expensive is now just another quick pass.
The technical barrier is gone.
So here's the question I'm left with: now that mutation testing costs almost nothing, will we actually use it? Will teams that never had the resources to invest in this level of testing quality start doing it?
Or will we find new excuses?
Because the old excuse ("we don't have time for that level of rigor") doesn't really work anymore. The time cost has collapsed. The tooling is there. The AI can do the heavy lifting.
What's left is just deciding to do it. And knowing that it's worth it.
What I Learned
Three concrete takeaways from this experience:
1. Line coverage lies, even in small codebases: 93% coverage looked great until mutation testing showed me the gaps. Those 15 surviving mutants were in critical error handling paths. After fixing them, I still had 99% line coverage. But now the tests actually verified what they claimed to test. If a 650-line application had 15 significant gaps, imagine larger systems.
2. AI makes rigor accessible for any project size: What used to be prohibitively expensive (manual mutation analysis) is now quick and almost frictionless. The economics have changed. From 15 survivors to 2 in just a few hours of work, most of it done by AI. This level of rigor is no longer reserved for critical systems. It's accessible for small internal tools too.
3. 99.7% is good enough: After the cleanup, I'm left with 2 surviving mutants out of 726. Could I hunt them down? Sure. Is it worth it? Probably not. They're edge cases in utility code that's already well-tested. The point isn't perfection. It's knowing where your gaps are and making informed decisions about them.
The real win isn't the numbers. It's the confidence. I now know exactly which 2 mutants survive and why. That's very different from having 93% coverage and hoping it's good enough.
This was a small project. If it had been bigger, I probably would have skipped mutation testing entirely (too expensive, too time-consuming). But now? Now I can't think of a good reason not to do it. Not when it costs almost nothing and reveals so much.
I used to think mutation testing was for perfectionists and critical systems only. Now I think it should be standard practice for any codebase you plan to maintain for more than a few months.
Not because it's perfect. But because it's no longer expensive.
And when the cost drops to almost zero, the excuses should too.
The AI Prompt That Worked
When facing surviving mutants, this single prompt did most of the heavy lifting:
"Run mutation testing withmake test-mutation. For each surviving mutant, usemake test-mutation-show MUTANT=nameto see the details. Analyze what test case is missing and create tests to kill these mutants, following the existing test style. After adding tests, runmake test-mutationagain to verify they're killed. Focus on the top 5-10 most critical gaps first: business logic, error handling, and edge cases in services and repositories."
The key: let the AI drive the mutation analysis loop while you focus on reviewing and prioritizing.
Getting Started
If you want to try this:
- Add mutmut to your project (5 minutes with AI help)
- Create simple Makefile targets to make it accessible for everyone:
make test-mutation- Run the full suitemake test-mutation-summary- Get the overviewmake test-mutation-report- See which mutants survivedmake test-mutation-show MUTANT=name- Investigate specific casesmake test-mutation-clean- Reset when needed
- Run it weekly, not on every commit (mutation testing is slow)
- Use AI to triage survivors (ask it to analyze and prioritize)
- Review the top 5-10 gaps as a pair, decide which matter
- Start with one critical module, not the whole codebase
Making it easy to run is as important as setting it up. The barrier is gone. What's stopping you?
When NOT to chase 100%: Those final 2 surviving mutants? They're in logging and configuration defaults that are battle-tested in production. Perfect mutation score isn't the goal. Knowing your gaps is. Focus on business logic and error handling first. Skip trivial code.
About This Project
This application was developed using TDD and AI-assisted development with Claude code and Cursor (using Sonnet 4.5, GPT-5 codex, and Composer1). The mutation testing setup and gap analysis were done with Claude's help using mutmut.
Timeline: The entire mutation testing setup and gap analysis took about 2-3 hours with AI assistance.
Final stats: 649 statements, 208 tests, 99% line coverage, 726 mutants tested, 724 killed (99.7% mutation score).
Related Reading
- When AI Makes Good Practices Almost Free - How AI is changing the economics of software quality practices
- Deming's Red Bead Experiment (video) - Understanding systematic vs random variation in processes

No comments:
Post a Comment