2026 · Decoder-only language model

Phoenix 125M

A 125M parameter LLM trained end-to-end on a single RTX 3080 Ti, from corpus to checkpoint.

Released 2026

Why I built this

I wanted to learn how LLMs actually work, not from an API, but by building one end-to-end. With a single 3080 Ti, training a 125M model from scratch was the right scope: ambitious enough to be real, small enough to be feasible on consumer hardware. Every decision, from architecture to data pipeline to training stability, was a lesson I could not have learned any other way.

125M Parameters

~2B Tokens trained

0.507 WinoGrande score

Architecture

Data Pipeline: 7 stages

01

Download

Fetch from HuggingFace and AIKosh. Validate sources and write to raw corpus directory.

HuggingFaceAIKosh
02

Extract

PDF text via PyMuPDF, HTML via BeautifulSoup. Structured text out, noise stripped.

PyMuPDFBeautifulSoup
03

Detect and Redact

XLM-RoBERTa language detection filters non-English. PII patterns redacted before any training.

XLM-RoBERTaPII redaction
04

Deduplicate

MinHash LSH deduplication across the full corpus by n-gram signature. Near-duplicates removed.

MinHash LSH
05

Mix

Corpus blended by source ratio: Wikipedia, C4-en, OpenWebText2, Gutenberg, StackExchange, ArXiv.

source mixingdomain balance
06

Tokenize

BPE tokenizer trained on the mixed corpus. 32K vocab, byte-level fallback, special tokens added.

BPE 32Kbyte-level
07

Shard

uint16 binary shard output for memory-mapped loading during training. No full corpus in RAM.

uint16 shardsmemory-mapped

Model

RoPE · rotary positional encoding
SwiGLU · gated activation, no ReLU
RMSNorm · pre-norm, no bias
FlashAttention · PyTorch 2.x kernel
12L / 12H / d768 · 125M parameters
bf16 · weight-tied in/out embeddings

Training Loop

Precision · bf16 mixed precision throughout
Memory · gradient checkpointing to fit 12 GB VRAM
Schedule · cosine LR with linear warmup
Stability · gradient clipping at 1.0, AdamW optimizer
Checkpoints · every 1,000 steps to NAS via SMB

Evaluation Harness

Perplexity · WikiText-103 held-out set
HellaSwag · normalized, zero-shot
WinoGrande · commonsense reasoning
ARC-Easy · science QA, 570 samples
LAMBADA · long-range accuracy
PIQA · physical intuition substitute

All benchmarks implemented from scratch, matching lm-evaluation-harness conventions

HuggingFace Export

Custom PhoenixForCausalLM class registered with HuggingFace Auto classes. Full tokenizer, model card with benchmark results, and inference examples. Apache 2.0 license.

Tech stack

Technologies used

core

PyTorch 2.xHuggingFace TransformersBPE Tokenizer (32K)FlashAttentionRoPESwiGLURMSNorm

infra

RTX 3080 Ti (12 GB VRAM)NAS via SMBRay (distributed preprocessing)

tools

MinHash LSH dedupXLM-RoBERTa (lang detect + PII)PyMuPDFBeautifulSoupMLflow + W&B (experiment tracking)DVC (data versioning)

Key highlights

Proof points

01
Trained a 125M parameter LLM from scratch on a single consumer GPU. No cloud, no distributed training.
02
Built a 7-stage data pipeline processing ~2B tokens from Wikipedia, C4-en, OpenWebText2, Project Gutenberg, StackExchange, and ArXiv.
03
WinoGrande score of 0.507: above random chance (0.50), showing the model captures basic commonsense structure.
04
Released under Apache 2.0 on HuggingFace with full model card, tokenizer, and inference examples.
05
Set up the next project: a QLoRA fine-tune of Mistral 7B on text-to-SQL with the same evaluation discipline.

Benchmark results

0.507

WinoGrande

chance = 0.50

0.279

HellaSwag

1K samples

0.358

ARC-Easy

570 samples

928.9

WikiText-103 PPL

lower = better

0.003

LAMBADA accuracy

long-range hard at 125M

Focus areas

PyTorchTransformersTokenizationBenchmarkingDistributed trainingMLflowWeights & BiasesDVC

Explore the work

View model card ← All projects

Phoenix 125M

Download

Extract

Detect and Redact

Deduplicate

Mix

Tokenize

Shard

Technologies used

Proof points