The Strongest AI Model You Can Train on a Laptop in 5 Minutes

Question:
What’s the most powerful AI model you can train on a MacBook Pro in just five minutes?

Short answer:
The best I managed was a ~1.8M parameter GPT-style transformer, trained on ~20M TinyStories tokens. It reached a perplexity of ~9.6 on a held-out split.

Example output (prompt in bold):

Once upon a time, there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time.

Not exactly Shakespeare, but not bad for five minutes.

The Challenge

This was mostly a fun, curiosity-driven experiment — and maybe a little silly — for two reasons:

If you can afford a MacBook Pro, you could just rent 30 minutes on an H100 GPU and train something vastly stronger.
If you’re stuck with a laptop, there’s no real reason to limit training to five minutes.

That said, constraints breed creativity. The goal: train the best possible language model in just five minutes of compute time.

Key Limitation: Tokens per Second

Five minutes isn’t long enough to push many tokens through a model, so:

Large models are out — they’re too slow per token.
Tiny models train quickly, but can’t learn much.

It’s a balancing act: better to train a 1M-parameter model on millions of tokens than a billion-parameter model on a few thousand.

Performance Optimization

Initial transformer training on Apple’s MPS backend hit ~3,000 tokens/sec.
Surprisingly:

torch.compile, float16, and other math tweaks didn’t help.
Gradient accumulation made things slower (launch overhead was the real bottleneck).
Switching from PyTorch to MLX gave no meaningful boost.

Best practices for this scale:

Use MPS
Skip compilation/quantization
Avoid gradient accumulation
Keep the model small

Choosing the Right Dataset

With ~10M tokens (~50MB text), dataset choice matters.

Simple English Wikipedia was an okay start, but output was fact-heavy and noun-obsessed.
TinyStories — synthetic, short, 4-year-old-level stories — worked far better:
- Coherent narratives
- Cause-and-effect logic
- Minimal proper nouns
- Simple grammar

Perfect for small language models.

Tokenization

Tokenizer training wasn’t counted in the five-minute budget. At this scale:

Tokenization overhead is negligible.
Multi-byte tokens are easier for small models to learn than raw characters.

Architecture Experiments

Transformers

GPT-2 style was the default choice.
SwiGLU activation gave a boost.
2–3 layers worked best.
Learning rate: 0.001–0.002 was optimal for fast convergence.
Positional embeddings outperformed RoPE.

LSTMs

Similar structure, but slightly worse perplexity than transformers.

Diffusion Models

Tried D3PM language diffusion — results were unusable, producing random tokens.
Transformers & LSTMs reached grammatical output within a minute; diffusion didn’t.

Finding the Sweet Spot in Model Size

Experimenting with sizes revealed:

~2M parameters was the upper practical limit.
Any bigger: too slow to converge in 5 minutes.
Any smaller: plateaued too early.

The Strongest AI Model You Can Train on a Laptop in 5 Minutes

This lined up with the Chinchilla scaling laws, which relate optimal model size to training tokens.

Final Thoughts

This experiment won’t change the future of AI training — most interesting behavior happens after five minutes. But it was:

A great way to explore tiny-model training dynamics
A fun test of laptop GPU capabilities
Proof that you can get a coherent storytelling model in five minutes

With better architectures and faster consumer GPUs, we might eventually see surprisingly capable models trained in minutes — right from a laptop.

In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.

You must be logged in to comment.

The Strongest AI Model You Can Train on a Laptop in 5 Minutes

The Challenge

Key Limitation: Tokens per Second

Performance Optimization

Choosing the Right Dataset

Tokenization