2026 · Fine-tuning · Text-to-SQL
SQLForge: Mistral 7B QLoRA
QLoRA fine-tune of Mistral 7B v0.3 on text-to-SQL. Exact set match goes from 9 percent to 87 percent on the internal test split, trained on a single RTX 3080 Ti.
Why I built this
After pretraining Phoenix 125M from scratch, the natural next question was: can the same hardware do something practically useful? Text-to-SQL was the right target. The base Mistral 7B knows SQL syntax from The Stack pretraining but is unreliable on real schemas. QLoRA on 178K examples was small enough to fit, large enough to matter, and the eval signal is concrete (does the query return the right rows). The schema-aware eval rebuild was the most important part. The first WikiSQL run reported 0 percent valid SQL because the validator was checking string equality instead of execution. Fixing that turned a panic moment into a 97.4 percent result.
Architecture
Model selection
- Mistral 7B v0.3 · Apache 2.0, no gating
- ~8.25 GB peak VRAM · 3.75 GB headroom on 12 GB
- Code-heavy pretraining · The Stack v1 + GitHub
- Trivial chat template · clean loss masking on answers
- vLLM ready · no extra flags for serving
- Largest delta · capable base → big eval lift
VRAM budget on 12 GB
-
3.60
Model weights
4-bit NF4 quantisation with double quantisation. ~22 GB BF16 base compressed into 3.6 GB.
-
0.03
LoRA adapters
Rank 64 across 7 target modules (q, k, v, o, gate, up, down) in BF16. Trainable parameter count is tiny.
-
0.12
AdamW optimiser state
FP32 momentum and variance for adapter weights only. The frozen base model contributes nothing here.
-
3.00
Activations
Batch size 1, sequence length 512, gradient checkpointing on. Recompute trades compute for memory.
-
1.50
CUDA kernels and fragmentation
Overhead for cuBLAS, bitsandbytes kernels, and allocator fragmentation. Buffered to stay safe.
-
8.25
Total estimate
3.75 GB headroom on a 12 GB RTX 3080 Ti. Allows future sequence-length and rank increases without OOM.
Data and recipe
- Primary · b-mc2/sql-create-context (78,577 examples, Apache 2.0)
- Augmentation · gretelai/synthetic_text_to_sql (100K, Apache 2.0)
- Quantisation · 4-bit NF4 with double quant via bitsandbytes
- LoRA · rank 64, alpha 16, dropout 0.05, 7 target modules
- Loss masking · DataCollatorForCompletionOnlyLM on the [/INST] split
- Schedule · cosine LR with warmup, gradient checkpointing on
Evaluation harness
Three benchmarks: a 500-example internal test split (exact set match plus schema-aware valid SQL), WikiSQL test set with true execution accuracy against table rows, and Spider 1.0 dev for community comparability. The first WikiSQL run reported 0 percent valid SQL because the validator was checking string equality; rewriting it to actually run the queries against table schemas turned the metric honest and revealed a real 97.4 percent valid rate.
Serving
Adapter merged back into the base weights and re-exported in safetensors format. Loads cleanly into vLLM with no custom code, ready for batched throughput serving. Apache 2.0 license, published on HuggingFace alongside the evaluation results.
Tech stack
Technologies used
core
data
tools
Key highlights
Proof points
- 01
Exact set match jumped from 9.2 percent (base Mistral 7B v0.3) to 87.0 percent on the 500-example internal test split. A 77.8 percentage point lift from a single fine-tune run.
- 02
Schema-aware valid SQL hit 97.4 percent. The original string-match validator reported 0 percent; rewriting it to execute queries against table schemas turned a panic into a real result.
- 03
Trained on a single RTX 3080 Ti with ~8.25 GB peak VRAM. The 3.75 GB headroom budgeted up front meant zero OOM during a 1,173-second eval run on 500 examples.
- 04
WikiSQL execution accuracy of 28.19 percent on 15,878 test examples, evaluated by actually running the predicted SQL against the WikiSQL table rows.
- 05
Picked Mistral 7B v0.3 over Llama 3.1 8B and Phi-3.5-mini after a written decision matrix on VRAM, license, code pretraining, and instruction template stability.
- 06
Adapter merged into base weights and exported in safetensors. Loads into vLLM with no extra flags. Released under Apache 2.0 on HuggingFace.
Benchmark results
Exact set match
internal test, 500 examples
Baseline EM
base Mistral, same split
Valid SQL (schema-aware)
vs 86.2% baseline
WikiSQL execution accuracy
15,878 examples
Peak VRAM
on 12 GB RTX 3080 Ti
Focus areas
Explore the work