Codurance AI Hackathon

This hackathon explored AI-powered software development. Isaac Oldwood shares what he learned from the event.

On Saturday 26th April, I attended an invite-only AI Hackathon at the Codurance headquarters in London. I give a run down of the motivations behind the event, what happened and what we learned. At the end, I discuss some limitations of the findings as well as further questions to be considered.

Codurance

Codurance is a global software consultancy that helps businesses build a better sustainable technical capability to support growth. The software craftsmanship ethos shaped the company. The goal of Software Craftsmanship is clear: raise the bar in the software industry through professionalism and technical excellence.

I do not work with or for Codurance and have no official affiliation with the company. I was on the invite list as I co-organise the Software Crafters Cambridge monthly tech meetup with their Head of Marketing, Natalie Gray.

At one of our planning sessions, Natalie mentioned the event was happening and asked if it was something I would be interested in. I was keen to attend as there is currently so much hype around LLMs and AI integrated tools. I wanted to see if we could cut through all the hype and noise and really learn some valuable real-world lessons.

Glossary

AI Tools	Any tools powered by AI that developers can use in their job, eg Cursor, ChatGPT and many more.
ChatGPT	A LLM created by OpenAI, widely regarded as the first ‘mainstream AI’.
Co-pilot	An AI/LLM tool that is integrated directly into VSCode editor.
Cursor	A new code editor with AI built-in.
GitHub	An online website to store code.
LLM	Large language models are a type of artificial intelligence (AI) program that can recognize and generate text (and code).
VSCode	A code editor.

I use AI and LLM interchangeably throughout this piece as AI used in the context of the hackathon exclusively refers to varying integrations of LLMs.

The aim

The event was advertised as:

AI is transforming software development, but how effective is AI-powered coding in real-world scenarios? Join Codurance’s [...] AI Hackathon to put AI-assisted development to the test!

Online, I see lots of examples of people building web apps ‘entirely’ using AI, but on closer inspection these projects are not generally up to standard. They are usually not well structured, tested or maintainable. The aim of the hackathon was to see how good some of the AI tools available are at writing real-world production-standard code that developers would be proud of.

Format

Timetable
09:30	Arrival and registration
10:30	Kick-off and Challenge 1
12:45	Lunch
13:45	Challenge 2
16:00	Playback and discussion
16:45	Event close
17:00	Pub

I arrived at the Codurance office a bit early due to train times (about 9:15am). Once I was buzzed in, I was met by Matt Belcher and Rowan Lea. Matt is ‘Head of Emerging Technology’ and Rowan is a ‘Software Craftsperson’, both working at Codurance. I was warmly welcomed and given a brief tour of the office as I was the first to arrive.

Over the next 45 minutes, other developers filtered in. It was great to meet everyone! There was a wide range of experience in both the software development industry and using AI tools – I think this played into the whole day very well. Some people were veterans of the industry with 30+ years experience but only had briefly used ChatGPT. At the other end, there was a developer who was still quite early in their career but has been following and using AI tools extensively since they first arrived on the scene. This was great as it meant that everyone had something to contribute but also something to learn and improve on!

After some chatting and fuelling (coffee drinking), Matt and Rowan invited everyone into the space we would be working in. They then explained that everyone was to be split into group A and group B. Group A would use AI tools for the first exercise whilst group B would use traditional non-AI methods. In the afternoon this would be swapped around so everyone gets a go! I was assigned to group A so I got to dive right into the AI tools. It was explained that we should get into pairs or threes within our group to tackle the two exercises.

The brief for both exercises can be found on Matt Belcher’s GitHub. You can find each group’s output in the forks of each repository.

Exercise 1

Exercise 1 was revealed as ‘StyleDen’ and asked you to ‘build a minimal viable product (MVP) for their e-commerce website’.

For the first exercise, I paired with a C# developer who had been exploring and learning Python. As I was most comfortable with Python, we decided to work together and knowledge share along the way. Since we were assigned to the AI first group, we had a discussion about the best way to use it, and more importantly the best way to put it through its paces. We decided to try and use it to its full potential and avoid writing a single line of code if possible, i.e. just prompting and guiding it.

At the beginning of the task, all we had was a README which contained all the requirements. The first thing we needed was a plan. As previously mentioned, we wanted to fully utilise AI so we passed the entire README to ChatGPT and asked it to produce a solution to complete the exercise.

The first section it produced was titled ‘Overview’ and it was essentially the parts of the app we would need and suggested technologies for them. It mentioned a frontend built in React, a backend built using Python’s FastAPI and a SQLite Database.

It then laid out a file directory structure to help us visualise how to split out the app. It listed some key API endpoints which we reviewed to make sure all the requirements were met. It was good to see these aligned with how we would have designed them ourselves.

One part of ChatGPT’s output that I was really interested in was a section titled ‘Tech Stack (Quick Justification)’. This section outlined WHY it chose to use the technologies described above. For me this is a really key aspect of using AI. In most of the uses I see of AI, we ask it to complete some task or ask it a question; we very rarely ask the AI to explain (this is a key point I raise later in the day).

The last part it produced was a ‘Plan of Attack (MVP Steps)’. This was really useful as it gave us smaller bite size chunks to iterate on as we created our MVP. My only issue with the plan of attack was ‘Write some unit tests (especially backend)’ was at the bottom of the list. This highlights an issue I have seen repeatedly with AI- (and human-) developed code. Testing is not considered; or if it is, only as an afterthought. As an advocate of Test Driven Development (TDD), this is a real issue for me. I want tests to be written first based on the requirements, then code to be written to pass those tests. Just to reiterate, this is for production code as was the aim of the day. I understand that usually for a ‘Hackathon’, you are building some form of prototype and it may not be the time or place for TDD.

As we wanted to fully embrace AI, we concluded to use the technologies suggested by ChatGPT. This was partially a decision due to us having some experience with the technologies, but also because these are technologies that are widely used. This means in theory the LLMs will have plenty of training data and should produce decent code. That was the theory at least…

To actually start writing the code I used GitHub Co-pilot built into VSCode – with this you can use ‘agent’ mode. This allows you to prompt an ‘AI Agent’ which will then make edits directly in your files. We started at the first step of the ChatGPT plan of attack and asked it to create a SQLite database along with a seed script (to load the CSV into the database). This worked first time and created a file that worked successfully without any tweaks. However, it did not create any tests.To rectify this we discarded the changes and added ‘Using TDD…’ at the start of the prompt. The second attempt created a very similar script whilst also writing some tests.

As a side note, now reflecting on the day, it has been pointed out to me that it is possible that this isn’t really proper TDD. An LLM writing code and tests in one loop/prompt does not force the tests to be written first and then code to be written to satisfy those tests. It is certainly possible that the production code is written first and then the tests are written. It is not clear to the prompter. Perhaps a better process would be using the LLM to write the unit tests first in one prompt, verifying the tests, and then using another prompt to write the production code to satisfy those tests.

The second step was creating a boilerplate FastAPI app. I used some prompts such as ‘Create a boilerplate FastAPI app using TDD’, this created a very basic app as well as using the FastAPI framework.

Another thing that we explored is documentation writing. If we were writing real production code, this app would need to be worked on by other developers that may not have experience with writing/running these APIs. So after getting some working code we asked Co-pilot to ‘Add local setup and running steps to the README’. The documentation produced was easy to follow and contained all the necessary steps to get the app up and running locally.

The rest of the first session followed in this flow. After a basic API was created we moved onto the frontend. Neither myself or my partner have extensive experience with React (though I am trying to learn a bit more). The first thing Co-pilot did was ask to run create-react-app. I was surprised that it was capable of using the terminal directly.To clarify, it does ask your permission before running every command with a simple ‘Continue’ button. I do worry that people may just click ‘Continue’ without fully understanding the commands being run, which could become a security concern.

My part of the exercise was to create the cart page. I prompted Co-pilot to create a new cart page with tests. I asked it to add some basic functionality; for example, allow the user to increase/decrease the item count in the cart. As well as, if the item count reached zero then remove it from the cart. After some manual testing of the app, I discovered that once I removed the last item from the cart the table still showed but just empty. This was bad UX in my opinion. I was happily surprised with how easy this was to improve by prompting Co-pilot ‘Currently when no items are left in the cart nothing happens, update this code and tests to display a message such as “No items in cart”’ It updated the code and tests in a straightforward way and in very little time.

By this point, we were running out of time. I wanted to add a couple of finishing touches and asked the AI to add some images and a dynamic total at the bottom of the table. You can see the code’s final state on my GitHub along with all the local running instructions. All the code and documentation has been entirely written by AI tools. My partner and I edited no code manually. To summarise, I was very impressed with how quickly we got a working app up and running with very little intervention from us humans.

Lunch

Lunch was provided by Codurance and gave us all some well-earned time to reflect. Of course, Exercise 1 dominated the topic of discussion. There was lots of chatting between pairs within group A about what tools were used, what prompts worked well and other tips and tricks. There were also lots of discussions between group A and B about varying aspects of the task. The key takeaways were:

Group A got further in the exercise (a more complete solution with more features) than Group B
Clearly due to using AI tools, it allowed them to work faster
Co-pilot and ChatGPT were widely chosen AI tools
It seemed like this is due to familiarity and being built into VSCode, most of the developers’ editor of choice
The AIs did not write unit tests unless specifically asked, but when prompted it did write them mostly to an acceptable standard

Exercise 2

The second exercise was revealed as ‘StreamStack’. Essentially, build a movie reviewing website. For this task we decided to mix up the pairs, which allowed new ideas and networking. I ended up forming a three with two other developers who were happy to use Python. We knew we would have no AI help for this exercise so we needed to stick to tools and technology we had experience with.

We started off by working out that we would need a backend and frontend. We wrote down some questions and design decisions on post-it notes and created a rough architecture/design diagram. One of the team had experience with React and so offered to handle the frontend part. This left me and the other team member to create the backend.

As the functionality was on the simpler side, I suggested using FastAPI. It is my preferred technology for creating APIs as it is simple, integrates with Pydantic for validation and has a great testing framework. My backend partner had not used FastAPI before and preferred Flask, it didn’t take me long to persuade them to give it a try!

We continued much as you’d expect at a hackathon: we used TDD to put together the backend API and start integrating it with the UI. It was noticeably slower this time round compared to using the AI tools (especially without the auto-complete/in-line suggestions). Although this time round I personally felt I understood every line of code and was happy that it would pass a code review. I also spent next to no time at all reviewing the code as we actually wrote it ourselves.

An example of being slower was right at the start. We needed to create the FastAPI app, first of all just with a “Hello world” endpoint to make sure we had set it up right. Previously, I would have asked Co-pilot or ChatGPT to write a very brief boilerplate file for a FastAPI app. This time we had to google the FastAPI docs, navigate to the quick start guide and copy the code from there. As I had used this many times before I knew where to look, which sped things up somewhat. However, this process would have certainly been faster with the use of an AI tool.

By the end of the exercise we had a slightly crude web app with a UI and a backend. It had some basic filtering and sorting functionality but we did not have time to complete all of the requested features in the given time. It did have a full test suite though!

End of day discussion

This was the part of the day that was the most insightful to me. A pair from group B kicked off the ‘show and tell’ by showing their ‘StreamStack’ app. They had used Cursor and it was immediately very impressive. They had a complete application that had every functionality asked for, looked nice and they even had time to add bonus things like images. One of the members of the pair said something that really stuck with me, though. They explained that the application was practically a black box as they had only given it a few prompts and just asked it to create the application. After the AI had finished, they had tried to use images on a different page and were unable to get it working; this should have been trivial. They said, “This application was written two hours ago and I already feel like I’m working with legacy code.” They believed that if they had written it all then adding these images would be trivial but because it was a black box they would take much longer to understand and make these changes.

I feel like my first pair had a similar problem with the AI code being a bit of a black box. This prompted me to ask the question, “There has been lots of talk about black box code and not easily understanding the AI changes – did anyone ask the AI to explain the code?”. There was a long pause as it was clear no one had done this, including myself! It seems all groups had spent the day asking AI to write/change code and not once asked it to explain code. This is a feature that has been advertised, particularly with Co-pilot’s chat feature. I have used this a few times at work when moving into a new project. I think that was a large unexplored part and a use that we should have tested more during the hackathon.

Another group spoke about abstraction and refactoring. They said that the AI tools heavily favour ‘copying and pasting’ similar code instead of extracting and refactoring into its own function for reusing elsewhere. They had similar functionality in three places in their app and the AI re-created the logic every time. If they wanted to tweak it they would have to change it in multiple places. It seems AI does not follow DRY. They did explain that with some guidance and prompting the AI tools could refactor and extract logic, but it wasn’t natural and had to be requested specifically.

A pair of developers followed on from this point. They asked the AI tools to refactor some code in a specific file; it did manage this but along the way would update and change unrelated code in other files. Another person raised their hand and agreed with this point. They vented some frustration with this in their day job. They told us the following anecdote; they were working on a large codebase with many files and wanted to update/refactor a specific file. By default, Co-pilot will take your whole workspace as ‘context’ to make these changes. Unfortunately, that also means it can access and make changes to every file in your workspace. They suggested a good improvement to the tool would be to tell the AI to ‘read’ these files for context but only allow ‘write’ changes in file X,Y and Z.

Lastly, a member of my three for Exercise 2 said, “I have achieved a lot less in this problem compared with using AI tools; however, I can say for sure, I am more proud of the code I have written.” I think this is a key point because, as developers, all code we commit has our name on it. We should be proud of the code we write. This perpetuates ownership and in my opinion results in better code being written.

Post-event

After the event we headed to the pub. There was still a bit of chatting about AI but we mostly were all done with discussing AI for the day. It was nice to chat about other non-AI stuff over a beer. We all agreed we would love to attend a similar event in the future!

Limitations

If we were to do this again there are some things I would like to test. I think we gave the AI tools the best possible chance by picking problems that are widely solved with lots of examples on the internet. Having said that, there are some questions raised:

How well does it perform when writing code for something other than a web app e.g. embedded systems?
How well does it perform in a different problem domain?
What about in a domain where there is lots of context required that may not be widely documented in the training data?
How well does it perform in an existing code base?
Both of these exercises were building something new. How well does it work when asked to change/write new code in an existing project?
Would developers with more AI experience do better?
Some of the developers had little experience with AI tools. Are there ways of working that unlock better output? Had we known these, would we have done better?
How good and useful is asking AI to summarise/explain code?
Most of these tools allow you to change the LLM being used. Would different LLM choices have produced better results?
As previously mentioned, does adding “Use TDD…” to the prompt actually use TDD within one prompt or does it require a two step process?
How safe is allowing the LLMs to directly run commands in the terminal?
Is a ‘Continue’ button enough to prompt the user to verify the code vs copying and pasting commands from the internet?

TLDR

My key takeaways are:

The AI tools are great for writing boilerplate/setup code.
AI tools avoid DRY.
The AI tools did not write unit tests unless specifically asked, but when prompted it did write them to an acceptable standard.
The AI tools did better when asked to work in smaller steps.
Developers are more proud of their work when using less AI.
Some tools are better than others, with the tools that can edit directly in the IDE saving more time.
The ‘auto-complete’/in-line functionality is the way most developers use the AI tools.

Ultimately, it is clear to me: developers can already move faster and be more productive with AI tools and these effects are only increasing.

Isaac Oldwood is a Software Engineer working in the Insurance Industry. He taught himself Python at university to (unsuccessfully) purchase a pair of limited edition shoes. He organises Software Crafters Cambridge, a monthly tech meetup. In his spare time, he enjoys reading, rugby and running.

This article was first published on 2 May 2025 on Isaac’s blog: https://isaacoldwood.com/blog#codurance-ai-hackathon