Question:
What’s the most powerful AI model you can train on a MacBook Pro in just five minutes?
Short answer:
The best I managed was a ~1.8M parameter GPT-style transformer, trained on ~20M TinyStories tokens. It reached a perplexity of ~9.6 on a held-out split.
Example output (prompt in bold):
Once upon a time, there was a little boy named Tim. Tim had a small box that he liked to play with. He would push the box to open. One day, he found a big red ball in his yard. Tim was so happy. He picked it up and showed it to his friend, Jane. “Look at my bag! I need it!” she said. They played with the ball all day and had a great time.
Not exactly Shakespeare, but not bad for five minutes.
The Challenge
This was mostly a fun, curiosity-driven experiment — and maybe a little silly — for two reasons:
- If you can afford a MacBook Pro, you could just rent 30 minutes on an H100 GPU and train something vastly stronger.
- If you’re stuck with a laptop, there’s no real reason to limit training to five minutes.
That said, constraints breed creativity. The goal: train the best possible language model in just five minutes of compute time.
Key Limitation: Tokens per Second
Five minutes isn’t long enough to push many tokens through a model, so:
- Large models are out — they’re too slow per token.
- Tiny models train quickly, but can’t learn much.
It’s a balancing act: better to train a 1M-parameter model on millions of tokens than a billion-parameter model on a few thousand.
Performance Optimization
Initial transformer training on Apple’s MPS backend hit ~3,000 tokens/sec.
Surprisingly:
- torch.compile, float16, and other math tweaks didn’t help.
- Gradient accumulation made things slower (launch overhead was the real bottleneck).
- Switching from PyTorch to MLX gave no meaningful boost.
Best practices for this scale:
- Use MPS
- Skip compilation/quantization
- Avoid gradient accumulation
- Keep the model small
Choosing the Right Dataset
With ~10M tokens (~50MB text), dataset choice matters.
-
Simple English Wikipedia was an okay start, but output was fact-heavy and noun-obsessed.
-
TinyStories — synthetic, short, 4-year-old-level stories — worked far better:
- Coherent narratives
- Cause-and-effect logic
- Minimal proper nouns
- Simple grammar
Perfect for small language models.
Tokenization
Tokenizer training wasn’t counted in the five-minute budget. At this scale:
- Tokenization overhead is negligible.
- Multi-byte tokens are easier for small models to learn than raw characters.
Architecture Experiments
Transformers
- GPT-2 style was the default choice.
- SwiGLU activation gave a boost.
- 2–3 layers worked best.
- Learning rate: 0.001–0.002 was optimal for fast convergence.
- Positional embeddings outperformed RoPE.
LSTMs
- Similar structure, but slightly worse perplexity than transformers.
Diffusion Models
- Tried D3PM language diffusion — results were unusable, producing random tokens.
- Transformers & LSTMs reached grammatical output within a minute; diffusion didn’t.
Finding the Sweet Spot in Model Size
Experimenting with sizes revealed:
- ~2M parameters was the upper practical limit.
- Any bigger: too slow to converge in 5 minutes.
- Any smaller: plateaued too early.

This lined up with the Chinchilla scaling laws, which relate optimal model size to training tokens.
Final Thoughts
This experiment won’t change the future of AI training — most interesting behavior happens after five minutes. But it was:
- A great way to explore tiny-model training dynamics
- A fun test of laptop GPU capabilities
- Proof that you can get a coherent storytelling model in five minutes
With better architectures and faster consumer GPUs, we might eventually see surprisingly capable models trained in minutes — right from a laptop.
In case you have found a mistake in the text, please send a message to the author by selecting the mistake and pressing Ctrl-Enter.
Sign In
