2026 · Decoder-only language model
Phoenix 125M
A 125M parameter LLM trained end-to-end on a single RTX 3080 Ti — corpus to checkpoint.
Why I built this
I wanted to learn how LLMs actually work — not from an API, but by building one end-to-end. With a single 3080 Ti, training a 125M model from scratch was the right scope: ambitious enough to be real, small enough to be feasible on consumer hardware. Every decision — architecture, data pipeline, training stability — was a lesson I could not have learned any other way.
Architecture
Data Pipeline — 7 stages
Download from HuggingFace and AIKosh → PDF/HTML text extraction via PyMuPDF and BeautifulSoup → XLM-RoBERTa language detection and PII redaction → MinHash LSH deduplication → corpus mixing by source ratio → BPE tokenizer training (32K vocab) → uint16 binary shard output for memory-mapped loading.
Model
Decoder-only transformer: RoPE positional encoding, SwiGLU activations, RMSNorm (pre-norm), PyTorch 2.x FlashAttention, and weight-tied input/output embeddings. 125M parameters across 12 layers, 12 heads, d_model=768, bf16 precision.
Training Loop
bf16 mixed precision, gradient checkpointing to fit 12 GB VRAM, cosine LR schedule with warmup, gradient clipping at 1.0, AdamW optimizer. Checkpoints every 1K steps to NAS via SMB.
Evaluation Harness
Perplexity on WikiText-103. Zero-shot benchmarks implemented from scratch: HellaSwag (normalized), WinoGrande, ARC-Easy, LAMBADA accuracy, PIQA substitute — matching lm-evaluation-harness conventions.
HuggingFace Export
Custom PhoenixForCausalLM class registered with HuggingFace Auto classes. Full tokenizer, model card with benchmark results, and inference examples. Apache 2.0 license.
Tech stack
Technologies used
core
infra
tools
Key highlights
Proof points
- 01
Trained a 125M parameter LLM from scratch on a single consumer GPU — no cloud, no distributed training.
- 02
Built a 7-stage data pipeline processing ~2B tokens from Wikipedia, C4-en, OpenWebText2, Project Gutenberg, StackExchange, and ArXiv.
- 03
WinoGrande score of 0.507 — above random chance (0.50), showing the model captures basic commonsense structure.
- 04
Released under Apache 2.0 on HuggingFace with full model card, tokenizer, and inference examples.
- 05
What's next: fine-tuning Mistral-7B for SQL generation, applying the same evaluation discipline to instruction following.
Benchmark results
WinoGrande
chance = 0.50
HellaSwag
1K samples
ARC-Easy
570 samples
WikiText-103 PPL
lower = better
LAMBADA accuracy
long-range hard at 125M
Focus areas
Explore the work