Surely at this point everyone has heard the story where Replit deleted a production database, overriding its rules and instructions along the way to do so, and another instance with the Gemini CLI deleting a load of code followed right behind. These are just examples of the coverage that because AI makes mistakes its dangerous and shouldn’t be used. Definitely pay close attention to failures, and then dig in and lean back on our engineering chops, as really the AI tooling is incidental in these examples.
To a decent degree, we are just falling for our human vulnerability of being distracted by significant negative events, we’re working against evolution on this, and so I get the reaction. What I’m really really looking for is our rational sides to take over and work the problem.
If your delegate goes off the rails and does a bunch of destructive things, you have delegated poorly, whether you understood that or not
There are two things that stand out to me as lessons to take forward. Solid engineering practices remain incredibly important, and what is really going to matter is the rate of failures being introduced into our systems.
Guardrails and incident management practices are an obvious set of things key to reducing and managing the risk and impact of bad things happening. Note I said reduce, not eliminate. This is has always been core to this school of thought, zero risk if even achievable is likely cost prohibitive, and so we approach this from managing the risk and what guardrails we need in place to do that.
Human engineers have been happily killing production systems for as long as we’ve had them (so so many examples come to mind, both personal and legend), it would have been a really weird response if we just stopped writing code as the only way to prevent those issues. Over time we built up incident response skills, and then incident management processes that fed into making guardrail improvements, this is how we reduce impact and risk.
The part that came later was basically the codifying of the DORA 4 metrics as a core view of tracking the throughput and quality of the changes we make. There are certainly other metric sets you can build as your feedback loop, but you definitely should have something.
What you really must know is are you shifting either the rate, or your recovery time from, failures. This is true whether or not you are in the midst of an AI push. We acknowledge that some early work has shown that increases in throughput and change size (from using AI tools) has negatively impacted change failure rates. Thats actually just a predictable outcome from the model of delivery here, but it might be ok. What you really want to think about is if the shift in your risk profile is ok for the increased value you are getting from shipping more, thats a tradeoff decision for you and your team.
Finally, as a human engineer do not blame the AI tool, and don’t tolerate anyone else blaming either. They are tools, you should learn and know how to wield them risk and all. But it’s more important than that. If we as engineers will remain involved (and so employed) then the tools are our delegate, this is how we should be approaching this in all ways and this is a concept with a long legal basis, treat the choices as such.
So if your delegate goes off the rails and does a bunch of destructive things, you have delegated poorly whether you understood that or not, you still blew that play. That is your signal to dig in, learn and get some new guardrails in place.