In an era where massive language models dominate headlines, a fascinating trend is emerging: highly specialized smaller models are proving surprisingly capable. But just how capable are they?

At Anaconda, we’re committed to helping data scientists and engineers find the right tools for each job. As part of our broader mission to empower data practitioners with reliable open source solutions, we’ve been exploring the performance boundaries of specialized models. This research directly informs our strategy for tools like Lumen AI, where optimizing AI capabilities across different model sizes and specialties is crucial.

A year ago, Motherduck released a 7B parameter open-source model focused solely on SQL generation. DuckDB has been gaining significant traction, finding its way into diverse applications from data analysis pipelines to AI platforms like our own Lumen AI. This model was advertised to write DuckDB native queries without the computational overhead of larger models.

This raises some intriguing questions: How does this specialized model stack up against much bigger models like GPT-4o? And given the rapid pace of AI advancement, where does it stand compared to newer 7B models like Qwen-2.5-Coder and the buzzworthy Deepseek R1 Distilled?

Explore the full code and project in Anaconda Notebooks: 

Experiment Setup

Dataset

To put these models to the test, the experiment utilized the common penguins.csv dataset that’s widely available online. Using Claude Sonnet 3.5, we generated 20 question and query pairs about the dataset for evaluation purposes.

Evaluators

When implementing the evaluations using Arize, it was not long before we encountered the first roadbump: there’s no straightforward way to determine if two SQL queries are equivalent. The complexity stems from several factors:

  • Syntactic Variation: The same logical query can be written in multiple syntactically different ways – from simple differences in alias names to completely different but logically equivalent query structures.
  • Output Ambiguity: Different queries can produce seemingly similar results while having distinct logical meanings. Without explicit constraints on aspects like result ordering or row limits, comparing outputs becomes unreliable.

This realization led to a more robust evaluation approach:

First, test questions were refined to be explicit and unambiguous. Instead of open-ended queries like “Show me penguin data by year,” I specified precise requirements such as “Which one year had the most penguin observations? Return both year and count.” This eliminates ambiguity about the expected output format and content.

Then, rather than attempting exact SQL string matching, a set of metrics to validate query correctness was developed. These proxies assess whether the generated query captures the intended analysis by comparing key characteristics of the results – like data boundaries, result dimensions, and execution performance:

  • result: The raw output returned by the SQL query execution
  • query: The actual SQL query text that was generated and executed
  • error: Any error messages or exceptions encountered during query execution (None if successful)
  • elapsed: Time taken for the query to execute, measuring performance efficiency
  • expected_max: The maximum value expected in the result set, serving as a validation boundary
  • expected_min: The minimum value expected in the result set, providing a lower validation boundary
  • actual_max: The maximum value found in the actual query results for comparison
  • actual_min: The minimum value found in the actual query results for comparison
  • expected_shape: The expected dimensions (rows × columns) of the result DataFrame
  • actual_shape: The actual dimensions of the returned DataFrame, validating result structure

Prompt Format

With these evaluations set up, we are almost ready to run the experiments. There is one last thing to do: setting up a prompt format. According to the HuggingFace repository, the DuckDB model works best with the prompt format defined below because it was fine-tuned that way:

### Instruction:

Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.

### Input:

Here is the database schema that the SQL query will run on:

CREATE TABLE taxi (

    VendorID bigint,

    tpep_pickup_datetime timestamp,

    tpep_dropoff_datetime timestamp,

    passenger_count double,

    trip_distance double,

    fare_amount double,

    extra double,

    tip_amount double,

    tolls_amount double,

    improvement_surcharge double,

    total_amount double,

);

### Question:

get all columns ending with _amount from taxi table

### Response (use duckdb shorthand if possible):

While this rigid prompt structure imposes some limitations—particularly around function calling capabilities—if the performance is good, workarounds can be created to use it downstream.

The following Jinja2 template was used:

### Instruction: Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.

### Input: Here are the database schemas that the SQL query will run on:

{%- for table in table_schemas %}

CREATE TABLE {{ table.table_name }} (

{%- for column in table.columns %}

    {{ column.name }} {{ column.type }}

{%- endfor %}

);

{% endfor %}

### Question: {{ question }}

### Response (use duckdb shorthand if possible):

And here’s an example output:

### Instruction: Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.

### Input: Here are the database schemas that the SQL query will run on:

CREATE TABLE penguins_raw (

    species VARCHAR,

    island VARCHAR,

    bill_length_mm VARCHAR,

    bill_depth_mm VARCHAR,

    flipper_length_mm VARCHAR,

    body_mass_g VARCHAR,

    sex VARCHAR,

    year BIGINT

);

### Question: What is the average body mass (in g) for each species? Return both species name and average mass.

### Response (use duckdb shorthand if possible):

Experiment Results

Penguins – DuckDB-NSQL 7B – No pre-processing, no-examples

The initial results were unfortunately unspectacular:

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         18 |    0.222222 |

| Evaluator.matching_min   |  20 |         18 |    0.111111 |

| Evaluator.matching_shape |  20 |         18 |    0.5      |

| Evaluator.no_error       |  20 |         18 |    0.5      |

Out of 20 test questions, 18 queries executed successfully while 2 encountered Python errors, but the bigger issue was that for the 18 executable queries:

  • Only 22.2% matched the expected maximum values
  • Just 11.1% matched the expected minimum values
  • 50% produced results with the correct shape (dimensions)
  • 50% executed without any SQL syntax errors

A common error encountered was a type mismatch:

Binder Error: No function matches the given name and argument types 'avg(VARCHAR)'. You might need to add explicit type casts.

Candidate functions:

avg(DECIMAL) -> DECIMAL

avg(SMALLINT) -> DOUBLE

avg(INTEGER) -> DOUBLE

avg(BIGINT) -> DOUBLE

avg(HUGEINT) -> DOUBLE

avg(DOUBLE) -> DOUBLE

The error message shows an attempt to average string values, or VARCHAR.

Looking at our penguins dataset, we can see why:

species,island,bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g,sex,year

Adelie,Torgersen,NA,NA,NA,NA,NA,2007

The presence of ‘NA’ strings rather than proper NULL values is causing DuckDB to interpret these numeric columns as text, leading to the type mismatch when attempting mathematical operations.

Penguins – DuckDB-NSQL 7B – No pre-processing, yes-examples

In the next attempt, we iterated on our experiment based off our experiences and similar issues documented in the Arize docs:

> “There is a binder error, which may also have to do with not understanding the data format.”

Thus, we incorporated examples of the data to the model:

### Instruction: Your task is to generate valid duckdb SQL to answer the following question, given a duckdb database schema.

### Input: Here are the database schemas that the SQL query will run on:

CREATE TABLE penguins_raw (

    species VARCHAR /* Examples: Gentoo, Adelie, Chinstrap */,

    island VARCHAR /* Examples: Dream, Torgersen, Biscoe */,

    bill_length_mm VARCHAR /* Examples: 34.1, 46, 40.5 */,

    bill_depth_mm VARCHAR /* Examples: 19.3, 17.6, 21.2 */,

    flipper_length_mm VARCHAR /* Examples: 197, 200, 210 */,

    body_mass_g VARCHAR /* Examples: 3300, 3700, 3325 */,

    sex VARCHAR /* Examples: female, NA, male */,

    year BIGINT /* Examples: 2007, 2009, 2008 */

);

This performed a bit better, but still nowhere close to expectations: The model still wasn’t capable of dropping the NA values.

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         20 |        0.3  |

| Evaluator.matching_min   |  20 |         20 |        0.2  |

| Evaluator.matching_shape |  20 |         20 |        0.55 |

| Evaluator.no_error       |  20 |         20 |        0.55 |

Penguins – Qwen-2.5-Coder 7B – No pre-processing, yes-examples

What about other models of the same size, but more modern, like Qwen-2.5-Coder? They performed a bit better, but not significantly.

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         20 |        0.35 |

| Evaluator.matching_min   |  20 |         20 |        0.25 |

| Evaluator.matching_shape |  20 |         20 |        0.6  |

| Evaluator.no_error       |  20 |         20 |        0.6  |

Penguins – DuckDB-NSQL 7B – Pre-processing, yes-examples

After some pre-processing to remove the NAs, the experiment showed dramatically better results:

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         19 |           1 |

| Evaluator.matching_min   |  20 |         19 |           1 |

| Evaluator.matching_shape |  20 |         19 |           1 |

| Evaluator.no_error       |  20 |         19 |           1 |

Spectacular! These results are extremely promising. Next, we’ll examine why it performed so much better.

Baggage – DuckDB-NSQL 7B – Pre-processing, yes-examples

Testing with another dataset – airplane baggage – showed:

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         20 |        0.75 |

| Evaluator.matching_min   |  20 |         20 |        0.85 |

| Evaluator.matching_shape |  20 |         20 |        0.95 |

| Evaluator.no_error       |  20 |         20 |        0.95 |

Not as good, but not bad! What were the slip-ups?

How many mishandled baggage cases were reported by each carrier specifically in April 2019? Return carrier name and count.

Generated Query:

SELECT CARRIER_NAME, COUNT(*) FROM baggage WHERE MONTH = 4 AND YEAR = 2019 GROUP BY CARRIER_NAME;

Expected Query:

SELECT CARRIER_NAME, SUM(MISHANDLED_BAGGAGE) as mishandled_count FROM baggage WHERE YEAR = 2019 AND MONTH = 4 GROUP BY CARRIER_NAME

Max: 2 Expected Max: 82855.0 ❌

Min: 1 Expected Min: 925.0 ❌

Here, the queries are quite similar, but the mismatch occurs because the model mistakenly interpreted “how many cases” to mean “how many records” rather than “sum of mishandled baggage cases.” Perhaps this question wasn’t phrased as explicit and clear as we had anticipated. A few others are like this as well.

Baggage – Qwen-2.5-Coder – Pre-processing, yes-examples

Running with Qwen showed even better results:

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |  20 |         20 |         0.9 |

| Evaluator.matching_min   |  20 |         20 |         1   |

| Evaluator.matching_shape |  20 |         20 |         1   |

| Evaluator.no_error       |  20 |         20 |         1   |

Tasks Summary (01/30/25 05:43 PM -0800)

---------------------------------------

|   n_examples |   n_runs |   n_errors |

|-------------:|---------:|-----------:|

|           20 |       20 |          0 |

Baggage & Flights – DuckDB-NSQL 7B and Qwen-2.5-Coder – Yes-examples

After seeing the impressive performance on single-table operations, I was eager to test these models’ capabilities with more complex scenarios involving table joins, because what is SQL without table joins?

Both models were tested with five more challenging join scenarios:

  1. Calculating mishandled baggage rates per 1000 flights
  2. Computing annual passenger and flight totals
  3. Analyzing international to domestic flight ratios
  4. Determining average enplaned baggage per flight
  5. Finding carriers with the largest decrease in mishandled baggage rates

The results were disappointing, with both models scoring:

Experiment Summary (02/10/25 11:19 AM -0800)

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |   5 |          5 |         0   |

| Evaluator.matching_min   |   5 |          5 |         0.2 |

| Evaluator.matching_shape |   5 |          5 |         0.4 |

| Evaluator.no_error       |   5 |          5 |         0.4 |

Tasks Summary (02/10/25 11:19 AM -0800)

---------------------------------------

|   n_examples |   n_runs |   n_errors |

|-------------:|---------:|-----------:|

|            5 |        5 |          0 |

The most interesting findings came from analyzing the failure patterns. Both models struggled with:

  1. Column Hallucination: The models frequently invented columns that didn’t exist in the schema. For example, DuckDB-NSQL tried to join on a non-existent “FlightDate” column instead of using the actual “Year” column. Similarly, attempts to reference columns like “FlightNumber” and “AIRCRAFT_ID” appeared in queries despite these columns not being present in either table.
  2. Join Condition Completeness: The models often missed critical join conditions. For example, when joining flights and baggage tables, they frequently forgot to include the month in the join condition alongside the year, leading to incorrect result multiplication.
  3. Column Ownership: Both models struggled with column ambiguity. A common error was trying to reference columns like “International” or “Domestic” directly from the baggage table when they actually belonged to the flights table.
  4. Aggregate Function Usage: The models often chose incorrect aggregation approaches, such as using AVG instead of SUM/COUNT combinations, or forgetting to handle NULL values with NULLIF.

Example of a problematic query:

Generated Query

SELECT Year, (SUM(MISHANDLED_BAGGAGE) * 1000) / SUM(TOTAL) as mishandled_per_1000_flights FROM flights JOIN baggage ON flights.FlightDate = baggage.YEAR GROUP BY Year ORDER BY mishandled_per_1000_flights DESC;

Expected Query

SELECT b.YEAR, (SUM(b.MISHANDLED_BAGGAGE) * 1000.0) / SUM(f.TOTAL) as mishandled_per_1000_flights FROM baggage b JOIN flights f ON b.YEAR = f.Year AND b.MONTH = f.Month GROUP BY b.YEAR ORDER BY mishandled_per_1000_flights DESC LIMIT 1

Baggage & Flights – DeepSeek-R1-Distill-Qwen-7B – Yes-examples

Since joins are more complicated, this seemed like a suitable job for a model that can reason, like DeepSeek R1 Distill… right?

Tackling this question:

“’Find the year with the highest ratio of mishandled baggage per 1000 flights, where the ratio is calculated as (total mishandled baggage × 1000) / total flights for each year. Return two columns: YEAR and mishandled_per_1000_flights.’”

DeepSeek’s (the small Distill version) output included an impressive thinking process:

"<think>\nAlright, let's tackle this problem step by step. I need to write a SQL query that calculates the ratio of mishandled baggage per 1000 flights for each year from the given database schemas and then returns the top year with the highest ratio.\n\nFirst, I'll look at the tables involved: the `flights` table and the `baggage` table. The key here is to relate these two tables since we need to connect baggage mishandled counts with flight counts by year.\n\nIn the `flights` table, there's a `Year` column that seems similar to the one in the `baggage` table. So, I'll join these two tables on the `Year` column to get matched data for each year.\n\nNext, since we're calculating per 1000 flights, I need to ensure both flight counts and baggage mishandled counts are aggregated by year before performing any calculations. That means I should use aggregate functions like `SUM()` with `GROUP BY` on the `Year`.\n\nOnce I have the total flights and total mishandled baggage for each year, calculating the ratio is straightforward: `(total_mishandled_bags * 1000) / total_flights`. But I need to handle division by zero just in case there's a year with no flights. Maybe adding an `IF` statement or checking if `total_flights` is greater than zero.\n\nAfter calculating the ratio, I'll sort the results in descending order based on the `mishandled_per_1000_flights` column to find the highest ratio.\n\nPutting it all together, my SQL query will join the two tables, aggregate by year, compute the ratio, and then select only the top result. This should give me the desired output with YEAR and Mishandled per 1000 Flights.\n</think>\n\nTo solve this problem using DuckDB shorthand, we need to:\n\n1. **Join** the `flights` table with the `baggage` table by year to calculate the ratio of mishandled baggage per 1000 flights.\n\n2. **Aggregate** the data by year to get total flights and total mishandled baggage counts for each year.\n\n3. **Compute** the ratio as (total_mishandled_bags * 1000) / total_flights, but only if there are flights (`total_flights > 0`).\n\n4. **Sort** the results in descending order based on `mishandled_per_1000_flights`.\n\n5. **Select** the top result.\n\nHere is the DuckDB shorthand query to achieve this:\n\n```sql\nSELECT \n    y.YEAR AS YEAR,\n    (f.bags * 1000.0 / f.flights) AS mishandled_per_1000_flights\nFROM \n    flights f\nJOIN \n    baggage b ON f.YEAR = b.YEAR\nGROUP BY \n    YEAR\nORDER BY \n    mishandled_per_1000_flights DESC\nLIMIT 1;\n```\n\nThis query:\n- Joins `flights` (f) and `baggage` (b) by year.\n- Groups the results by year.\n- Orders them in descending order based on the computed ratio of mishandled baggage per 1000 flights.\n- Limits the result to the top entry.\n\nIf you need all entries instead of just the highest, remove the `LIMIT 1`.\n\nThis solution assumes that:\n- The columns from both tables have matching values by year (e.g., similar date formats or actual years).\n- All rows are numeric as per the schema definitions.

Generated Query:

SELECT y.YEAR AS YEAR, (f.bags * 1000.0 / f.flights) AS mishandled_per_1000_flights FROM flights f JOIN baggage b ON f.YEAR = b.YEAR GROUP BY YEAR ORDER BY mishandled_per_1000_flights DESC LIMIT 1;

However, it still faced the same problem:

Binder Error: Ambiguous reference to column name “YEAR” (use: “b.YEAR” or “f.YEAR”)

Expected Query:

SELECT b.YEAR, (SUM(b.MISHANDLED_BAGGAGE) * 1000.0) / SUM(f.TOTAL) as mishandled_per_1000_flights FROM baggage b JOIN flights f ON b.YEAR = f.Year AND b.MONTH = f.Month GROUP BY b.YEAR ORDER BY mishandled_per_1000_flights DESC LIMIT 1

This generated query:

  1. Ambiguous YEAR reference: The query didn’t specify which table’s YEAR column to use (b.YEAR or f.YEAR)
  2. Incorrect column names:
    • Used f.bags instead of b.MISHANDLED_BAGGAGE
    • Used f.flights instead of f.TOTAL
  3. Missing grouping by MONTH: Since we’re joining baggage and flights tables that have monthly data, we need to match both YEAR and MONTH to get accurate totals
  4. Missing aggregation: The query needed to SUM the values before calculating the ratio

In the end, it not only performed worse, but also took much longer to finish!

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |   5 |          4 |        0    |

| Evaluator.matching_min   |   5 |          4 |        0    |

| Evaluator.matching_shape |   5 |          4 |        0    |

| Evaluator.no_error       |   5 |          4 |        0.25 |

Baggage & Flights – OpenAI gpt-4o – Yes-examples

These were all open source 7B models. How does a much, much larger model, like OpenAI GPT-4o, compare?

--------------------------------------------

| evaluator                |   n |   n_scores |   avg_score |

|:-------------------------|----:|-----------:|------------:|

| Evaluator.matching_max   |   5 |          5 |         0.8 |

| Evaluator.matching_min   |   5 |          5 |         0.6 |

| Evaluator.matching_shape |   5 |          5 |         1   |

| Evaluator.no_error       |   5 |          5 |         1   |

Amazing, but not perfect! While the 7B models often hallucinated non-existent columns or made fundamental join mistakes, GPT-4’s errors were more nuanced:

  1. Aggregation Logic: In the mishandled baggage rate calculation, GPT-4 initially used passenger counts instead of flight totals as the denominator. This is a logically defensible choice (one could argue that per-passenger rates are meaningful), but it wasn’t what was requested.
  2. Business Logic Understanding: For rate calculations between years, GPT-4 sometimes missed the need to ensure consistent time periods across the joins. This manifested in queries that produced mathematically valid but analytically incorrect results.
  3. “Creative” Problem Solving: In several cases, GPT-4 added sophisticated CTE (Common Table Expression) structures that, while technically impressive, actually overcomplicated the solution. For example, it used passenger-based normalization when simple flight counts would have sufficed.

Conclusions

This experiment explored whether specialized, fine-tuned models could compete with AI giants in specific domains. While results show they aren’t quite there yet for complex operations, the findings are encouraging. With proper data preprocessing, DuckDB-NSQL achieved perfect scores on single-table operations—demonstrating remarkable capability when properly matched to tasks.

Though join operations remain challenging for 7B models, their rapid improvement suggests this gap will continue to narrow. At Anaconda, these insights guide how AI-powered tools like Lumen AI approach model selection: using specialized models for straightforward queries where they excel, while reserving larger models for complex operations.

Anaconda users can conduct their own comparative experiments through AI Navigator, finding the optimal balance between performance, cost, and accuracy. As specialized models continue to evolve, Anaconda remains committed to providing access to the full spectrum of options, empowering data scientists to select the right tool for each specific job.

Takeaways:

  • Both specialized and general-purpose 7B models struggled significantly with joins.
  • Preprocessing data was crucial – after cleaning the data, DuckDB-NSQL achieved near 100% accuracy on single-table operations
  • Providing example data dramatically improved model performance across all tasks
  • Clear, explicit query specifications led to better evaluation experience
  • LLMs proved valuable partners in designing the experimental framework itself

Caveats:

  • The prompt format was optimized specifically for DuckDB-NSQL, potentially disadvantaging other models
  • One-shot evaluation doesn’t reflect real-world usage where developers iterate and refine queries
  • Given the non-deterministic nature of LLMs, multiple runs would provide more reliable results
  • An LLM-based evaluation system could provide more nuanced analysis of query correctness
  • More sophisticated semantic equivalence testing could better handle syntactically different but logically equivalent queries
  • The evaluation dataset is extremely limited in size

These caveats represent limitations in our experimental design, but we invite you to help explore them!