In a Nutshell
In February I experimented with making Claude more useful to me in building Rails apps. Two months is a long time in AI, so there’ll need to be an update soon, but in the meantime, these are the most helpful things I found:
- Standard out-of-the-box stuff (tests, linting with rubocop, security checks with brakeman) (also Conventional Commits, to make commit messages easier to parse)
- TDD with Lada Kesseler’s TDD skill
- Mutation testing with mutant
- Brainstorming, planning, and reviewing stages with Every’s compound-engineering plugin
- Keeping the code easier to change with CodeScene’s CodeHealth metric
I built voice_assistant (on GitHub and GitLab) with this setup.
(Two days ago, Tanya Janca, who is very good on security, introduced secure coding rules distilled from her latest book into AI prompts, so if I were restarting today I would also grab those and add the tier 1 prompt to Claude’s main memory.)
In More Detail
Version 0: Where Vibe Coding Is (Unexpectedly) Useful
The product idea: I wanted an app that let me set alerts in a remote location. So if you were looking after a forgetful older relative, say, and they were worried about missing meals you could tell it to remind them by speaking out loud every day half an hour before breakfast. I hadn’t had time to build it, there was buzz about how in Q4 2025 AI coding tools had improved enormously, so I opened Claude and started vibe coding it.
This ended up as 500+ lines of untested JavaScript in a single file, and was weirdly fun. It was the quickest to develop, and the easiest to experiment in – in subsequent versions I just prompted “use Deepgram for voice recognition and Eleven Labs for voice synthesis” because I’d already tried out third-party libraries in version 0. It still has more functionality than later versions because it was so easy to tell it to add a feature.
From a product point of view and for spiking new features and integrations to see how / whether they’ll work, this is amazing. You’d never want to deploy it, but for playing around with product ideas on a local machine and seeing what worked and what didn’t, this was awesome. (That there’s a time and place where vibe coding makes sense is the discovery I least expected.)
Testing
Earlier in 2025 I spent a hackathon trying to get Cursor to be sure to run the tests after changing code, which it somehow wasn’t doing out of the box. So we built a Cursor rule, and whenever it still failed to run the tests we would ask Cursor to fix its own rule to make sure that it ran them next time and it produced more and more shouty exhortations like
ABSOLUTELY 🚨 MANDATORILY 🚨 RUN THE TESTS EVERY TIME, NO EXCEPTIONS
… but it would still fail to run the tests.
I knew I wanted reinforcements in ensuring the tests ran and I’d enjoyed the Swarmia podcast with Lada Kesseler on her year of hands-on coding with AI and around the same time Kesseler announced her Claude Code skill factory and there was a TDD skill in it so I added that to the mix. And after that the tests ran consistently, which was a pleasant surprise.
(There are other TDD skills: I know a couple of shops that are using obra/superpowers which includes one. I haven’t done comparisons yet. Thoughtbot passes Martin Fowler’s definition to Claude and tells it to do that. Antony Marcano mentions that he is currently developing a new TDD skill but that meanwhile the Kesseler skill is one of the better ones.)
Mutation Testing
I discovered mutant and mutation testing in an Arkency “Ruby with TDD and DDD” course in 2016. Mutation testing can be seen as an extreme version of code coverage: it changes your implementation code (making a line returning true instead of false, say) and then runs the tests and sees if they fail. If they don’t, that proves your tests didn’t really cover that line, and to get the mutation test score to 100% you’d need to add another test, and then re-run mutant and make sure it’s really covered. Repeat as required. This gives a more accurate result than simple line coverage, which just verifies that while running the tests the code touches the line, not necessarily that it checks what the line does.
In pre-AI times, near-100% coverage was seen as overkill for parts of the code that were stable or not mission-critical, though you’d still want it where the code was changing frequently or mission-critical because it gives you more confidence that apparently-unrelated changes aren’t breaking things. In AI times, you definitely want 100% mutation testing coverage, because the AI could completely rewrite the implementation at any point, and tests reinforced by mutation testing will be better at catching that.
AI is safer at writing code in type safe languages because type checking makes it much easier to see if changes break anything. Mutation testing helps close that gap.
Antony Marcano and Andrzej Krzwyda both note that mutation testing is essential in an AI context.
Compound-Engineering
Every’s compound-engineering plugin is sort of like planning mode on steroids with a multi-agent code review built in. I was pleasantly surprised with how well it worked.
First: /ce:brainstorm
Give it a prompt for a feature, say, and it asks some high-level questions. It asks good questions. I imagine doing this step and the next step with both an engineer and a product person would be ideal. After back-and-forth and clarifications, it creates a document summarizing the brainstorming step (first voice_assistant brainstorm).
Next: /ce:plan
It takes the brainstorm document and builds a plan. It asks more surprisingly-good questions. Getting into lower-level details, but I’d still include a product person in this stage, as it’s the shortest feedback loop if a product question comes up. Again, there’s back-and-forth, clarifications, corrections from the humans after we’ve looked at the draft document, and finally it saves the agreed-on summary document (first voice_assistant plan).
Next: /ce:work
/ce:work takes the plan and executes it. In my case running with TDD and mutation testing and linting and so forth. In the beginning I was watching every line as it scrolled past. But it seemed reasonable while I was watching it so I let it get to where it said it was finished and then tried it out.
Next: Human-in-the-loop
There were things which only showed up watching it run: if we didn’t give Eleven Labs a default voice id it called the service to get the account’s default voice id, so we added an environment variable to skip that service call. It didn’t understand pluralizing out of the box, so something might happen in “1 minutes” (with hindsight, examples of “set timer for 5 minutes” and “set timer for 1 minute” would probably have prevented this). There was a whole thing about using quotation marks inside a command to mark off blocks (“set reminder in ‘5 minutes’ to ‘do something’”) where the parsing failed because the voice recognition expected the quotation marks to be spoken. It also failed to parse “set alarm at 6:49” because 49 is pronounced “forty-nine”, which it took to be “40 9” and assumed you meant “6:40” followed by a disregarded “9”. (After the first one like that came up manually, I asked it to handle and test for all other compound numbers too.)
So, a certain amount of confusion, I definitely wouldn’t deploy the results without running the app and doing a manual review step first, but I was surprised by how well it did.
Next: /ce:review
/ce:review spins up a slew of reviewers (some always on – testing, correctness, maintainability – some on depending on what the new code is doing – security, performance, api-contract – some on depending on the kind of code – ruby, python, typescript), has them all review the code, and then produces a consolidated list of issues, from P0s to P3s, and gives you the option of reviewing them and fixing them.
Next: Human-in-the-loop, again
Another eyeball step after that in case the fixes introduced new unexpected behaviour in the running app.
Optional for learnings from gotchas: /ce:compound
After untangling bugs / gnarlinesses you can run /ce:compound, which produces a document summarizing the problem, how it was fixed, and generalizable lessons from it, for example:
They aren’t automatic, you need to remember to prompt for it, but as a summary of a gotcha for a new team member (or for the AI doing further work in the same area), they look really useful.
Conclusion
I did not expect this to work as well as it did. I was very pleasantly surprised. There were problems which only showed up in the human-in-the-loop steps, mostly to do with unexpected behaviours inthird-party libraries, which were then fixed and had specs added. I would still be cautious of doing something so big in one go that the human-in-the-loop step wouldn’t be able to catch things.
Another possible problem: if your company has custom ways it wants to review code, and/or the way it wants to build code changes, it won’t work out-of-the-box. But if (when) that happens, you can make a local copy and build or update company-specific review agents that support the new way of doing things.
CodeHealth
The workflow so far produces reasonable, well-tested-code that implements the plan, though we needed a human-in-the-loop to get there. The next challenge is evolvability. Jason Gorman noted “LLMs [are] very good at generating code that they’re pretty bad at modifying later.” This was quantified in the recent SWE-CI benchmark that examined how agents sustain quality through long term evolution (with tasks derived from real repos over six months old and with lengthy commit histories), and the results show that most models making changes from that starting point introduce regressions.
We want to guard against that. Adam Tornhill, who has been studying this a long time (Your Code As a Crime Scene first came out in 2015), has determined that the CodeHealth metric, while calibrated for how easily people can understand code, is also associated with semantic preservation after AI refactoring, that is, a higher score makes it easier for AI tooling to make changes safely as well.
Checking the code after the first feature was added, it had a CodeHealth score of 9.7 (out of 10), so I did further refactoring (reducing cyclomatic complexity of gnarly methods) to push the score up to 10. For the first pass at the refactoring I gave Claude the CodeHealth warnings and told it to fix them (that can also be automated via MCP), and it reduced the complexity but also incorrectly associated two unrelated things, which helped me to see and prompt Claude to implement a better solution which made the parsing code easier to follow. So, again, keeping the human-in-the-loop produces a better result, but also, the first automated pass helped show me what the better solution was. It got part of the way but not all the way there on its own.
I only added two additional features (second, third), not the dozens implied in the SWE-CI benchmark, but it feels likely – though without running the experiment for six more months I can’t prove it – that pushing the CodeHealth metric back to 10 between features, and including a human-in-the-loop step to see if there’s a further code improvement that the automated systems had not yet found, will make it easier keep changing the code safely.
Conclusion
This was more helpful than I expected. It let me take a project that I hadn’t had time or space to work on and helped me get something up and running and deployed.
This success comes in large part from the tooling and guardrails I added: running with TDD and mutation testing and compound-engineering and CodeHealth, this version, the fifth, was the first (aside from the vibe-coded spike) to produce satisfactory results.
Also, keeping the human-in-the-loop to review was essential, both for noticing when third-party libraries weren’t working as expected and for seeing code improvements that the automated systems couldn’t quite reach. I’m not yet ready to let it loop on its own for hours.
But as an augmenting assistant, yes, this works. Much better than I expected.
Outro
Other things I haven’t tried yet but am curious about:
- Russ Miles has just finished a six-part series on Habitat Thinking and created a Claude plugin for giving the system a self-improving environment: next on my list! (I know one shop where the staff engineers regularly go through code review comments and decide what should go into Claude’s main memory, this sounds like a related effort)
- Tanya Janca’s distilled secure coding rules (just drop in and start using, I’d say)
- Antony Marcano’s in-development TDD skill (not public yet, and initially for Kotlin, but I’m curious)
- brooks-lint (code review tool that judges code against architecture decay risks, which sounds like it would be a useful supplement to the CodeHealth metric)
- I haven’t checked out the rest of the obra/superpowers plugin (the verification before completion skill takes me back to my “just run the tests already” hackathon project…)
Real time updates from the O’Reilly AI-Assisted Software Delivery event (16 April 2026):
- Vlad Khononov’s (of Balancing Coupling in Software Design) balanced coupling Claude plugin
Sean Miller: The Wandering Coder






