← All projects

2026 · Fine-tuning · Text-to-SQL

SQLForge: Mistral 7B QLoRA

QLoRA fine-tune of Mistral 7B v0.3 on text-to-SQL. Exact set match goes from 9 percent to 87 percent on the internal test split, trained on a single RTX 3080 Ti.

Released 2026

Why I built this

After pretraining Phoenix 125M from scratch, the natural next question was: can the same hardware do something practically useful? Text-to-SQL was the right target. The base Mistral 7B knows SQL syntax from The Stack pretraining but is unreliable on real schemas. QLoRA on 178K examples was small enough to fit, large enough to matter, and the eval signal is concrete (does the query return the right rows). The schema-aware eval rebuild was the most important part. The first WikiSQL run reported 0 percent valid SQL because the validator was checking string equality instead of execution. Fixing that turned a panic moment into a 97.4 percent result.

+77.8 pp Exact match lift over base
8.25 GB Peak VRAM on 12 GB card
97.4% Valid SQL (schema-aware)

Architecture

Model selection

  • Mistral 7B v0.3 · Apache 2.0, no gating
  • ~8.25 GB peak VRAM · 3.75 GB headroom on 12 GB
  • Code-heavy pretraining · The Stack v1 + GitHub
  • Trivial chat template · clean loss masking on answers
  • vLLM ready · no extra flags for serving
  • Largest delta · capable base → big eval lift

Beat Llama 3.1 8B (gated, 10.5 GB), Phi-3.5-mini (under-utilises GPU), Qwen2.5 7B

VRAM budget on 12 GB

  1. 3.60

    Model weights

    4-bit NF4 quantisation with double quantisation. ~22 GB BF16 base compressed into 3.6 GB.

  2. 0.03

    LoRA adapters

    Rank 64 across 7 target modules (q, k, v, o, gate, up, down) in BF16. Trainable parameter count is tiny.

  3. 0.12

    AdamW optimiser state

    FP32 momentum and variance for adapter weights only. The frozen base model contributes nothing here.

  4. 3.00

    Activations

    Batch size 1, sequence length 512, gradient checkpointing on. Recompute trades compute for memory.

  5. 1.50

    CUDA kernels and fragmentation

    Overhead for cuBLAS, bitsandbytes kernels, and allocator fragmentation. Buffered to stay safe.

  6. 8.25

    Total estimate

    3.75 GB headroom on a 12 GB RTX 3080 Ti. Allows future sequence-length and rank increases without OOM.

Data and recipe

  • Primary · b-mc2/sql-create-context (78,577 examples, Apache 2.0)
  • Augmentation · gretelai/synthetic_text_to_sql (100K, Apache 2.0)
  • Quantisation · 4-bit NF4 with double quant via bitsandbytes
  • LoRA · rank 64, alpha 16, dropout 0.05, 7 target modules
  • Loss masking · DataCollatorForCompletionOnlyLM on the [/INST] split
  • Schedule · cosine LR with warmup, gradient checkpointing on

Evaluation harness

Three benchmarks: a 500-example internal test split (exact set match plus schema-aware valid SQL), WikiSQL test set with true execution accuracy against table rows, and Spider 1.0 dev for community comparability. The first WikiSQL run reported 0 percent valid SQL because the validator was checking string equality; rewriting it to actually run the queries against table schemas turned the metric honest and revealed a real 97.4 percent valid rate.

Serving

Adapter merged back into the base weights and re-exported in safetensors format. Loads cleanly into vLLM with no custom code, ready for batched throughput serving. Apache 2.0 license, published on HuggingFace alongside the evaluation results.

Tech stack

Technologies used

core

PyTorch 2.xHuggingFace TransformersPEFT (QLoRA)bitsandbytes (4-bit NF4)TRL (SFTTrainer)

data

b-mc2/sql-create-contextgretelai/synthetic_text_to_sqlWikiSQLSpider 1.0 dev

tools

vLLM (serving)sqlglot (schema-aware validation)NAS via SMB (checkpoints)HuggingFace Hub

Key highlights

Proof points

  1. 01

    Exact set match jumped from 9.2 percent (base Mistral 7B v0.3) to 87.0 percent on the 500-example internal test split. A 77.8 percentage point lift from a single fine-tune run.

  2. 02

    Schema-aware valid SQL hit 97.4 percent. The original string-match validator reported 0 percent; rewriting it to execute queries against table schemas turned a panic into a real result.

  3. 03

    Trained on a single RTX 3080 Ti with ~8.25 GB peak VRAM. The 3.75 GB headroom budgeted up front meant zero OOM during a 1,173-second eval run on 500 examples.

  4. 04

    WikiSQL execution accuracy of 28.19 percent on 15,878 test examples, evaluated by actually running the predicted SQL against the WikiSQL table rows.

  5. 05

    Picked Mistral 7B v0.3 over Llama 3.1 8B and Phi-3.5-mini after a written decision matrix on VRAM, license, code pretraining, and instruction template stability.

  6. 06

    Adapter merged into base weights and exported in safetensors. Loads into vLLM with no extra flags. Released under Apache 2.0 on HuggingFace.

Benchmark results

87.0%

Exact set match

internal test, 500 examples

9.2%

Baseline EM

base Mistral, same split

97.4%

Valid SQL (schema-aware)

vs 86.2% baseline

28.19%

WikiSQL execution accuracy

15,878 examples

8.25 GB

Peak VRAM

on 12 GB RTX 3080 Ti

Focus areas

QLoRAFine-tuningbitsandbytesPEFTText-to-SQLEvaluation harnessesvLLM