A side-effect of all the time I spend breathing the rarified alpine air of the CDO community, exchanging pithy apophthegms about the future of the data profession, is that my SQL skills have become rather rusty (in truth, they’ve always been a bit shaky, but my Excel Pivot Table skills are on point, I’ll have you know). So I’ve been intrigued by the idea of using the code-generation capabilities of tools like ChatGPT and Bard to write SQL for me. Turns out I’m not the only person excited about this: already, dozens of tools have sprung up claiming to unlock the power of SQL for non-technical users through the magic of AI. But how good is the current crop of LLMs at creating SQL code that not only works, but generates the insight you’re actually looking for? I decided to find out.
Here’s how SQL code generation works with an LLM: You use your initial prompt to provide information about the schema of the database you’re working with – i.e. information about the tables, columns and datatypes. So long as this information isn’t too large for the LLM’s context window, you can then follow up with prompts asking the LLM to generate the SQL to answer specific questions from the data, and then use this SQL in your favourite query tool to get the actual answer.
All the text-to-SQL tools around today are just using some variant of this, integrating with the LLMs via their APIs and providing a bit of UI window-dressing to capture the schema information (and the database platform, which determines the SQL language version to use), but ultimately just bundling this up and passing it to the LLM (and in some cases, offering to execute the resulting query directly). AI Helper Bot is just one example, which you can try out without having to sign up with a credit card.
So instead of trying out all the new specialised tools, I went directly to the source, using the public chat interfaces for the three major LLMs in circulation, that is ChatGPT (v3.5 and 4), Anthropic’s Claude, and Google’s Bard. Here’s how the rankings came out:
- First place: GPT-4
- Second place (tied): GPT-3.5 and Claude
- Fourth place (distantly): Bard
What I discovered was this: GPT and Claude are pretty good at generating working SQL code, and I was able to nudge them (through a series of follow-up prompts) to get to the answer I needed. Bard was a lot less effective, to the point of being almost useless. But in almost no cases beyond very simple questions did I get the “right” answer on my first try, and I had to examine the generated SQL code quite closely to check that it was doing what it needed to do.
In one of the tests, all of the LLMs generated code that gave an answer that was close to correct, but contained a crucial mathematical error – the kind of error that could have made it into a dashboard and been reported for months or years before it was realised that it was wrong. Had I not been able to read SQL myself I would not have spotted this error, which has mildly terrifying implications for non-coders who aspire to use these tools to build reports.
Another key learning was that the specific language used in the prompts made quite a big difference to the generated code – a carelessly worded prompt could generate the wrong result easily. This also has important implications for the likely usage of text-to-SQL tools – the promise of these tools is that users can write down what they’d like to see and get the answer they need instantly, but this is very much not the case.
This shouldn’t come as a surprise, since when analysts sit down to write SQL to answer a question, sometimes even they don’t know exactly what they’re aiming for, not least because often, their stakeholders have not clearly articulated what they want to know.
So no, these LLMs cannot magically generate perfect SQL code that delivers exactly what you need first time. Ironically enough, though, these tools are quite well-suited to someone like me, who is prepared to iterate the prompt to work through the (numerous) syntactic errors made by the LLMs, and can parse the resulting code to ensure it’s doing the thing it’s meant to (though my own ability to do this would run out of steam pretty quickly). But anyone without a reasonable working knowledge of SQL should use these tools with extreme caution.
Interacting with the LLMs for this test felt like working with a wildly over-confident, but actually pretty rubbish, colleague, who confidently produces code that doesn’t work, even adding a helpful explanation of their non-working code. When the errors in the code are pointed out, the LLMs would respond with “Ah yes! You’re quite right, that won’t work” before often generating another block of code that still wasn’t right.
The rest of this post goes through the tests I ran in more detail.
About the data
The data I used to conduct this bake-off is a sample dataset from Maven Analytics for a fictional company called Northwinds Trading. The data is a relatively simple set of normalized tables containing customer details, orders, and order details, as well as some supporting information about product categories, salespeople and freight companies. From an analytics perspective, there are a couple of nice features of the data which provided some interesting tests of the capabilities of the LLMs I used for the SQL generation:
- The orders table contains information about the customer and freight costs, but not the actual order value
- The order details table contains a breakout of the items and their amounts and prices, plus a “discount” column that represents the percentage discount off the list price for the order
- The dataset comes with a very helpful data dictionary / schema which I provided to each LLM as an initial prompt (here’s an extract of the first few lines of the schema doc):
My initial prompt to the LLMs (along with the schema info) was as follows:
Here is a schema for a set of database tables hosted on MS SQL Server that describe sales and customers for a company, Northwinds Trading. I am going to ask you a series of questions about this data; you should respond with the appropriate SQL query to generate the answer.
I then proceeded to write a series of prompts and copy/paste the results in the SQL Server Management Studio to run against an Azure-hosted SQL Server that had the tables uploaded.
For each LLM I counted the number of prompts/follow-up prompts it took to get to a query that ran and generated a valid answer (for all of the prompts there was more than one valid answer; more on that below).
So how did the LLMs get on?
Prompt 1: What are the top 5 selling products?
We started off with an intentionally simple (though still slightly ambiguous) prompt. Here’s how they did:
All the successful LLMs interpreted the prompt as looking for the names of the top 5 selling products ranked by units sold (rather than revenue), with both the GPT models and Claude creating SQL which joined the order details and product details tables to return the friendly names of the products. Here’s the code that GPT-3.5 generated, and the resulting data:
GPT 3.5 gets a bonus point for including the product ID as well as the name, but otherwise the code generated was very similar across the three. GPT-4 gets a point for probably the best explanation of the code it generated alongside the code (it was consistently the best at describing the code it generated). Which brings us to Bard.
Bard, alas, did not generate any functioning SQL code for this task – it kept including commands and syntax that did not work on MS SQL Server, and when I pointed this out, it said “Sorry! You’re right, that won’t work on SQL Server” before providing another variation on the code that still didn’t work, eventually ending up in a loop, at which point I abandoned the process. It’s a shame because of the four LLMs it was the one that tried to include a bit more information in the query, for example the total sales amount for the products, alongside the sales volume. But it could not join the correct table fields in and get its head around T-SQL syntax.
Bard did include sample output in its response, which was vaguely helpful but also confusing (because it was making up data, as it didn’t have access to the actual data).
I shan’t keep kicking Bard while it’s down in this assessment (as you may imagine, it didn’t do any better on the following more complex prompts) but I shall call out a couple more interesting things that it did as we go through the more challenging prompts.
Prompt 2: Calculate the average percentage cost of shipping as a percentage of total order value
The intent of this question was to find out what share of order values was taken up by shipping costs. There are a couple of subtleties to answer this question correctly: The first is that the total order value needs to be calculated form the order details table, and take the discount column into account. The second is that the final percentage value can be calculated in one of the two following ways:
- Total Freight Costs / Total Order Value (answer: 5.13%)
- Total Freight Costs / (Total Order Value + Total Freight Costs) (answer: 4.88%)
The existence of these two potential answers illustrates the point I made in the introduction: Until I started to examine the output from the LLMs and think about the actual calculation being used, I hadn’t really thought about which of the two answers I really wanted. For the purposes of the test, I deemed a query that generated either one of the two answers above a successful query.
This prompt was a lot harder for the LLMs to answer. Here’s how long each one took to get to one of the above two answers:
Getting to one of the right answers was a real process with a couple of the LLMs. GPT 3.5 started out with an overly simplistic approach which joined orders to order values and double-counted (or more) the freight costs. It fixed this when it was pointed out, and then had to be guided to include the discount appropriately, and to use the right maths for the overall average. This latter problem was a common failing for all the models – their first attempt always used the simple arithmetic mean of a table of percentages, which is of course not right. Unfortunately, just as GPT 3.5 got close to the right answer, a prompt on this topic threw it completely off again and it forgot a lot of the earlier advances and ended up going around in circles, so I gave up after 13 prompts.
GPT 4 was a lot better. It generated a nice piece of SQL code out of the gate which used a common table expression to calculate the per-order costs. But it did make the arithmetic-mean mistake. When this was pointed out, it generated a working piece of SQL on the second attempt. It is possible that the additional use of the word “average” in the prompt (before “percentage cost”) was responsible for this: without this, GPT-4 got the answer right first time.
Claude fell somewhere between the two – it needed a lot more help (its first couple of tries just failed on basic syntax) but it got there in the end; it had to be specifically nudged about the discount and the arithmetic mean issue.
Sadly, Bard got nowhere near to an answer on this task. It started out by completely hallucinating two variables that it needed from the order table, and just got stuck in a loop futzing with these names, even though it acknowledged that they didn’t exist in the table. After only 4 tries it became clear that the conversation wasn’t going anywhere.
Prompt 3: Which customers spent the most each month?
The intended output of this prompt is a list of year/months, and the name of the customer that spent the most in that month, together with the amount they spent. As with the previous prompt, there are a couple of legitimate ways of calculating amount spent, either excluding or including freight costs (either version counted as a right answer in this test). A fully correct answer included the name, and not just the ID, of each top-spending customer. The following table shows a few lines on the intended output:
Here’s how the LLMs got on:
Both versions of GPT fared pretty well with this test. GPT 3.5 generated a couple of rounds of code that attempted to use a complex HAVING clause with joins to extract the months and years, but this generated errors; after a couple of rounds of this, it changed tack and created a CTE-based query that worked first time. GPT-4 took the CTE route out of the gate, but missed out an aggregation on a column in the CTE the first time; once this was pointed out, it generated code that executed very well.
Claude took a bit longer to get to the answer. Its first attempt failed because it used a function that isn’t supported on MS SQL Server; on its next attempt it generated a list of spend by month/year, but didn’t filter to the top spending customers only. Once this was pointed out, it took another couple of tries to get to the right list, but only included customer ID and not name. A prompt to add the name of the customers generated the desired output on the fifth try.
Bard actually came somewhat close to generating a right answer on this task, but it needed plenty of guidance. Its first attempt failed because it bizarrely ignored the information in the provided schema and referred to the order_details table as orderDetails. When the error was provided back to it as a second prompt, it pointlessly provided code to create the orderDetails table. The third prompt pointed this out more explicitly, saying that the table was called order_details, and with that information, Bard produced a working query that grouped customer spend by month (but did not select only the top-spending customers, or provide their names). Further prompts to try to get it closer to the desired output generated further code/table hallucination and ended up going round in circles, so I gave up after 9 prompts.
As I mentioned at the top of this post, although the capabilities of these LLMs in generating SQL code are objectively impressive (it’s easy to forget how magical this stuff is), it’s not clear when or whether they will get good enough to be able to generate perfect code first time that actually answers the question you have. Even if you know the question you need the answer to, phrasing this question in a specific-enough way to get the right output will be a matter of trial and error. For relatively simple questions, this might be fine; but the more complex the question, the more convoluted the prompt needed. At some point, it might just be easier to learn SQL than spend hours fine-tuning a prompt to get the same result (and chewing up a bunch of compute cost in the process). Time will tell.