← All projects

2026 · Decoder-only language model

Phoenix 125M

A 125M parameter LLM trained end-to-end on a single RTX 3080 Ti — corpus to checkpoint.

Released 2026

Why I built this

I wanted to learn how LLMs actually work — not from an API, but by building one end-to-end. With a single 3080 Ti, training a 125M model from scratch was the right scope: ambitious enough to be real, small enough to be feasible on consumer hardware. Every decision — architecture, data pipeline, training stability — was a lesson I could not have learned any other way.

125M Parameters
~2B Tokens trained
0.507 WinoGrande score

Architecture

Data Pipeline — 7 stages

Download from HuggingFace and AIKosh → PDF/HTML text extraction via PyMuPDF and BeautifulSoup → XLM-RoBERTa language detection and PII redaction → MinHash LSH deduplication → corpus mixing by source ratio → BPE tokenizer training (32K vocab) → uint16 binary shard output for memory-mapped loading.

Model

Decoder-only transformer: RoPE positional encoding, SwiGLU activations, RMSNorm (pre-norm), PyTorch 2.x FlashAttention, and weight-tied input/output embeddings. 125M parameters across 12 layers, 12 heads, d_model=768, bf16 precision.

Training Loop

bf16 mixed precision, gradient checkpointing to fit 12 GB VRAM, cosine LR schedule with warmup, gradient clipping at 1.0, AdamW optimizer. Checkpoints every 1K steps to NAS via SMB.

Evaluation Harness

Perplexity on WikiText-103. Zero-shot benchmarks implemented from scratch: HellaSwag (normalized), WinoGrande, ARC-Easy, LAMBADA accuracy, PIQA substitute — matching lm-evaluation-harness conventions.

HuggingFace Export

Custom PhoenixForCausalLM class registered with HuggingFace Auto classes. Full tokenizer, model card with benchmark results, and inference examples. Apache 2.0 license.

Tech stack

Technologies used

core

PyTorch 2.xHuggingFace TransformersBPE Tokenizer (32K)FlashAttentionRoPESwiGLURMSNorm

infra

RTX 3080 Ti (12 GB VRAM)NAS via SMBRay (distributed preprocessing)

tools

MinHash LSH dedupXLM-RoBERTa (lang detect + PII)PyMuPDFBeautifulSoup

Key highlights

Proof points

  1. 01

    Trained a 125M parameter LLM from scratch on a single consumer GPU — no cloud, no distributed training.

  2. 02

    Built a 7-stage data pipeline processing ~2B tokens from Wikipedia, C4-en, OpenWebText2, Project Gutenberg, StackExchange, and ArXiv.

  3. 03

    WinoGrande score of 0.507 — above random chance (0.50), showing the model captures basic commonsense structure.

  4. 04

    Released under Apache 2.0 on HuggingFace with full model card, tokenizer, and inference examples.

  5. 05

    What's next: fine-tuning Mistral-7B for SQL generation, applying the same evaluation discipline to instruction following.

Benchmark results

0.507

WinoGrande

chance = 0.50

0.279

HellaSwag

1K samples

0.358

ARC-Easy

570 samples

928.9

WikiText-103 PPL

lower = better

0.003

LAMBADA accuracy

long-range hard at 125M

Focus areas

PyTorchTransformersTokenizationBenchmarkingDistributed training