AI/Copilot and deterministic code refactoring

I tested whether AI can accurately refactor code. Spoiler: It can’t! But I had some fun and did some neat meta-programming along the way.

A code style thing

At work this week I did a possibly-controversial refactor.

There was a PHP array that was being converted into constants in a format like this:

$wizards = [
  'hermione' => 'Gryffindor',
  'harry' => 'Gryffindor',
  'luna' => 'Ravenclaw',
  'draco' => 'Slytherin',
];

foreach ($wizards as $name => $house) {
  if (! defined($name . 'House')) {
    define($name . 'House', $house);
  }
}

All perfectly sensible stuff. Until you realise that you lack some developer niceties:

You don’t get autocomplete for the wizardHouse constants.
You don’t get static analysis telling you a constant doesn’t exist.
You can make errors easily and without knowing!!

So I wanted to take the hit of more verbose code for the benefits of individual declarations:

/**
 * Hermione is smart, and organised. But oh my goodness she can be annoyingly self-righteous at times!
 */
if (! defined('hermioneHouse')) {
  define('hermioneHouse', 'Gryffindor');
}

You might disagree that this is better. That’s fine. This post is not about this. I’m just setting the scene.

Can AI refactor this?

I had about thirty of these constants to rework. A really boring job. But my brains goes immediately to “what’s the most efficient way?” and I considered:

Search/replace with regex
Multi-cursor editing
Writing some “meta-programming” code – a script that converts the array into the new code that I want.

But then I thought: We have GitHub Copilot at work, so I wondered if AI could help me here.

And it can! Cool! That was fast too!!

Can AI refactor this correctly?

But… AI is non-deterministic, right?

I went looking around the docs and the refactoring page says, over and over again:

As with all Copilot suggestions, you should always check that the revised code runs without errors and produces the correct result.

In the page of example refactoring prompts it is explicit about non-determinism:

The responses described in this article are examples. Copilot Chat responses are non-deterministic, so you may get different responses from the ones shown here.

(“Non-deterministic” means that you don’t always get the same output for any given input.)

So, while the refactoring that Copilot did looks correct, is it actually correct?

For my work job, I manually checked it all, and it had done the job perfectly – and saved me a bunch of boring typing too!

But I was left with the question: Is a refactor like this always going to be correct using generative AI like Copilot? Can it do more complex stuff? And if not, at what point does it start to make stuff up?

Breaking Copilot

To find out, I took the Harry Potter example and made it more and more complicated until it broke. Here’s what I ended up with:

class Wizard {
    public $name;
    public $house;
    public $nameLength = 0;
    public $firstStreet = '';
    public $firstPet = '';
    public $rockStarName = '';

    public function __construct(
        $name,
        $house,
        $nameLength,
        $firstStreet,
        $firstPet,
        $rockStarName,
        $rockStarNameLength
        ) {
        $this->name = $name;
        $this->house = $house;
        $this->nameLength = $nameLength;
        $this->firstStreet = $firstStreet;
        $this->firstPet = $firstPet;
        $this->rockStarName = $rockStarName;
        $this->rockStarNameLength = $rockStarNameLength;
    }
}

$wizards = [
    [
        'name' => 'Hermione',
        'house' => 'Gryffindor',
        'firstStreet' => 'Main',
        'firstPet' => 'Crookshanks',
        ],
    // etc...
];

Two things about this:

This is a contrived example. You wouldn’t pass the name lengths in – you would calculate them. I know that. Just go with it though.
Copilot was really helpful when generating the sample data. It knows a lot about Harry Potter!

The scenario here is that I want to convert the items in the array into new Wizard() constructors with raw data. Again, yes, contrived.

I prompted with:

Convert this array into a series of object constructors like:

$hermione = new Wizard(‘Hermione’, ‘Gryffindor’, 8, ‘Main’, ‘Crookshanks’, ‘Main Crookshanks’, 16);

And what I got back was very interesting. Very nearly, teasingly, deceptively complete and correct code:

<?php
$hermione = new Wizard('Hermione', 'Gryffindor', 8, 'Main', 'Crookshanks', 'Main Crookshanks', 16);
$harry = new Wizard('Harry', 'Gryffindor', 8, 'Privet', 'Hedwig', 'Privet Hedwig', 12);
$luna = new Wizard('Luna', 'Ravenclaw', 9, 'Moon', 'Mr. Whiskers', 'Moon Mr. Whiskers', 15);
$draco = new Wizard('Draco', 'Slytherin', 8, 'Snake', 'Nagini', 'Snake Nagini', 11);

BUT! Those final numbers are wrong. They ignore the spaces in the names!!

I broke Copilot’s refactoring. Sure, with a contrived example, but I broke it.

To be fair, Copilot tells me:

Note: The numbers 8, 16, 12, 9, 15, and 11 are placeholders. You should replace them with the appropriate values based on your requirements.

So it KNOWS that I broke it. But I actually missed that message the first time I did this. It’s very subtle and easy to miss when your focus is on the code. You could easily accept this refactor into your code and think it was fine.

What is Copilot doing?

The issue here is trust. What is copilot doing? How did it do my bulk-refactoring? Can I see how it did it?

At what point does it break? At what point is it deterministic, and at what point is it making stuff up? Does it always know when it’s messed up?

These LLM’s are black boxes. I don’t know what it’s doing to generate the modified code. And the warnings are clear and plentiful: This thing isn’t always right!

This unknown factor makes it untrustworthy for these kind of jobs beyond trivial scale or without manual review. And that makes it useless for the larger tasks when they would otherwise help the most.

Another way: Meta-programming!

Along the way in this journey I had a realisation. One of my options for doing this refactor was meta-programming. Writing some PHP that took the array and generated new PHP code as output that did what I wanted.

Insert exploding head emoji.

But then I realised… Copilot is actually good at writing simple code. So can Copilot do the meta-programming?

For my example here, I prompted it with:

Write a PHP function that takes in this array as input and has as output new PHP code for each array entry that will construct a new Wizard object like:

$hermione = new Wizard(‘Hermione’, ‘Gryffindor’, 8, ‘Main’, ‘Crookshanks’, ‘Main Crookshanks’, 16);

Now, this also very nearly, teasingly worked. Here’s what it gave me:

<?php
function generateWizardConstructors($wizards) {
    $output = '';
    foreach ($wizards as $wizard) {
        $name = $wizard['name'];
        $house = $wizard['house'];
        $firstStreet = $wizard['firstStreet'];
        $firstPet = $wizard['firstPet'];
        $combined = $firstStreet . ' ' . $firstPet;
        $randomNumber1 = rand(1, 20); // Placeholder for the 3rd parameter
        $randomNumber2 = rand(1, 20); // Placeholder for the 7th parameter

        $variableName = strtolower($name);
        $output .= "\$$variableName = new Wizard('$name', '$house', $randomNumber1, '$firstStreet', '$firstPet', '$combined', $randomNumber2);\n";
    }
    return $output;
}

This random numbers stand out though, don’t they? Nothing here should be random.

THIS is deterministic code though. I can see what it is doing. And I can see that it is doing it wrong. I can take this and make it what I need easily. I can probably get Copilot to tweak it to be correct.

Here’s some final meta-programming code you can run and see the output of:

https://web.tinkerwell.app/#/snippets/df0aba23-e7b9-4e9e-abaf-95255c137a11

What we learned – cool stuff!!

This was fun and interesting.

Copilot looks like it can do deterministic (reliable, accurate) refactoring, but it can’t. It knows that it can’t. It repeatedly warns you that it can make mistakes. It does not understand your code! It just churns out more of what looks the same. This is actually really dangerous!

BUT… we can get Copilot to help us do some bulk, deterministic programming by asking it to help us write meta programs that modify code. Sometimes. In some cases.

This was full of contrived examples. Perhaps it’s not actually useful. But perhaps you also learned something.